Project

General

Profile

Bug #11067

Strip meta-data of images etc. in PDFs

Added by htgoebel about 3 years ago. Updated almost 3 years ago.

Status:
Confirmed
Priority:
Elevated
Assignee:
Category:
File format support
Target version:
Start date:
02/06/2016
Due date:
% Done:

0%

Estimated time:
8.00 h
QA Check:
Dev Needed
Feature Branch:
Starter:
No

Description

meta-data of images in PDFs should be stripped.

Enclosed please find an example you can use for testing.

address-book-new.pdf (27 KB) htgoebel, 02/06/2016 11:35 AM

History

#1 Updated by htgoebel about 3 years ago

Addentum:

You can extract the image from the PDF as jpeg using pdfimages version 0.26.5:

pdfimage -all

#2 Updated by htgoebel about 3 years ago

Update: The correct, complete command is

pdfimages -all address-book-new.pdf img

You can then check the result with exiftool or (more basically) with strings:

strings -n 10 img-000.jpg | less

Note: exiv2 does not show this meta-data.

#3 Updated by jvoisin almost 3 years ago

Unfortunately, MAT doesn't (and won't) parse PDF on its on. It's using Cairo and Poppler instead. Extracting media isn't a practical approach.

A ghetto-solution would be to allow to user to chose a lossy method, consisting in rendering the PDF on an PNG surface, and to print it to a PDF one afterwards. But this will greatly reduce the quality of the cleaned file, both from an visual (you can't have real fonts on an image) and accessibility point of view (you can't select text on an image).

What do you think?

#4 Updated by htgoebel almost 3 years ago

The Python package pypdf2 can be used to manipulate the PDF:
- stripping meta-data
- extracting content elements (e.g. images) and replacing them by stripped ones
- removing Javascript and other active content
- removing attachments (or replacing them by stripped ones)
- even removing hyper-links

All you need to do is traversing the PDF structure in memory and strip of unwanted items.

Another Python package is pdfminer - but I did not use it. (But I just discovered pdfparanoia, which may be of interest, too.)

#5 Updated by jvoisin almost 3 years ago

  • Category set to File format support
  • Status changed from New to Confirmed
  • Assignee set to jvoisin
  • Priority changed from Normal to Elevated
  • Target version set to 0.7
  • Estimated time set to 8.00 h
  • QA Check set to Dev Needed
  • Starter set to No

#6 Updated by intrigeri almost 3 years ago

  • Tracker changed from Feature to Bug

Also available in: Atom PDF