Strip meta-data of images etc. in PDFs
meta-data of images in PDFs should be stripped.
Enclosed please find an example you can use for testing.
#3 Updated by jvoisin almost 3 years ago
Unfortunately, MAT doesn't (and won't) parse PDF on its on. It's using Cairo and Poppler instead. Extracting media isn't a practical approach.
A ghetto-solution would be to allow to user to chose a lossy method, consisting in rendering the PDF on an PNG surface, and to print it to a PDF one afterwards. But this will greatly reduce the quality of the cleaned file, both from an visual (you can't have real fonts on an image) and accessibility point of view (you can't select text on an image).
What do you think?
#4 Updated by htgoebel almost 3 years ago
The Python package pypdf2 can be used to manipulate the PDF:
- stripping meta-data
- extracting content elements (e.g. images) and replacing them by stripped ones
- removing attachments (or replacing them by stripped ones)
- even removing hyper-links
All you need to do is traversing the PDF structure in memory and strip of unwanted items.
Another Python package is pdfminer - but I did not use it. (But I just discovered pdfparanoia, which may be of interest, too.)