OPSEC

Metadata Hygiene: Stripping EXIF, PDF and Office Before You Publish

Por Equipe Basilisk ·

How to remove metadata that leaks identity, GPS and authorship from images, PDFs and Office documents before publishing online.

An anonymous researcher published a 12-page PDF exposing corporate fraud. Within 48 hours, his name was on Twitter. It was not social engineering: it was the PDF Author field, auto-populated by LibreOffice, plus a modification timestamp that lined up with his working hours. Metadata is the Achilles heel of any sensitive publication, and most office suites embed identifying information by default without warning the user. The Basilisk team treats metadata sanitization as a mandatory control before any public drop, on the same level as reviewing IPs and browser fingerprints. If you came from the practices in OPSEC for Security Researchers: Building a Personal Threat Model, you already know the technical part is the easy half.

Start with the basics: EXIF in images. A modern smartphone JPEG carries GPS coordinates accurate to 5 meters, device model, lens serial number, and even gyroscope orientation at the moment of capture. The standard tool remains Phil Harvey's ExifTool. For auditing, run exiftool -a -G1 -s file.jpg and you will see EXIF, XMP, IPTC, MakerNotes and ICC_Profile groups. For aggressive cleanup, exiftool -all= -overwrite_original *.jpg wipes everything, but keep an unedited backup outside your publishing directory. With PNG, the trouble is tEXt and zTXt chunks left by Photoshop or GIMP, often containing the username and full path of the source file.

PDF is the trickiest format because it carries metadata in three layers: the /Info dictionary, the XMP stream, and incremental object properties when the file was edited without linearizing. exiftool handles the first two, but for the third you need qpdf --linearize --object-streams=generate out.pdf in.pdf, which rewrites the entire structure and drops revision history. There is also the case of PDFs produced by corporate scanners: many embed the multifunction printer's serial number in the Producer field. If your document was printed and scanned, also consider the yellow tracking dots attack on laser printers, documented by the EFF and critical for anyone working in scenarios like Personal Security for High-Visibility Targets: Journalists, Activists, and Executives.

Office documents are a zoo. DOCX, XLSX and PPTX are actually ZIP archives containing XML, and docProps/core.xml lists author, last modified by, revision and company. Word also maintains rsid (revision save IDs) that allow correlating text fragments across different documents from the same author, an attack known as rsid fingerprinting. For reliable cleanup, use Word's Inspect Document followed by Remove All, or LibreOffice File > Properties > Reset Properties combined with Tools > Options > Security > Remove personal information on saving. For batch automation, oxml-document-cleaner in Python or mat2 (Metadata Anonymisation Toolkit) cover 95 percent of cases without opening a GUI.

mat2 deserves its own spotlight because it was built by people who think about real threat models, maintained by the Tails team. mat2 --inplace document.pdf works on more than 30 formats including SVG, MP4, FLAC and EPUB. The --check mode lists what is still left. Combine it with torsocks if you upload via Tor, following the practice described in Real Anonymity with Tor: What Works and What is Myth in 2026. For images straight from a phone, consider converting through an intermediate format: passing JPEG through ImageMagick with -strip removes ICC profiles and EXIF, but ImageMagick also reorders quantization bytes in an identifiable way, so for serious cases prefer jpegtran -copy none which preserves the original table.

There are traps no automated tool catches. Screenshots from variable-refresh-rate monitors leave codec microartifacts that identify the GPU model. LaTeX-generated PDFs embed the hyperref package signature including the compilation date in UTC, leaking timezone. MP4 videos carry a moov atom with the creator filesystem timestamp. For complete cleanup the Basilisk flow is: produce content in a disposable VM with UTC timezone, export via clipboard or isolated network share, sanitize with mat2 on the host, validate with exiftool -a -G1 -s and only then publish. The same pattern appears in Digital Compartmentalization: Separate Identities Without Leaking Metadata and in Tails, Whonix or Qubes OS: Which to Pick for Each OPSEC Scenario.

Adversarial validation closes the loop. Before publishing, upload the final file to a public inspection service like metadata2go or run Didier Stevens' pdfid.py and pdf-parser.py. Compare the output with what you expect: ideally no authorship fields, timestamps zeroed or set to 1970-01-01, and no embedded streams beyond what is strictly necessary. Document the checklist and version it alongside the content. Practical takeaway: build a shell alias called clean-doc that runs mat2 --inplace followed by exiftool -a -G1 -s on the result, and never publish anything without running that alias and reading the output. Three seconds of discipline beat three years of litigation.

Finally, remember metadata is not only a publishing problem: emails, Slack attachments and even uploads to public S3 buckets carry the same risk. Mature teams treat sanitization as a pipeline, not a manual step. Integrate mat2 into pre-commit hooks for repos that accept anonymous contributions and into upload gateways for whistleblowing platforms. Metadata hygiene will not protect you against a state-resourced adversary, but it eliminates the entire class of self-inflicted error that brings researchers down before the investigation even begins. Apply once, automate forever.

Nenhum comentário ainda

Seja o primeiro a comentar.

Deixe seu comentário

Entre com sua conta Canverly para comentar. Você pode usar a mesma conta em qualquer site da rede.

Entrar com Canverly