Obfuscate PDF Text: Scramble Copied Text with Crafted CMap
- This does not prevent your PDF from being copied. OCR nowadays is pretty advanced.
- Backup your document before you do anything.
- The author does not encourage plagiarism in any form.
In a PDF file where fonts are embedded, those glyphs are stored in a table, usually following the encoding in the source font. Some of the files, considering to reduce the file size, only uses a subset of the font. For example, to store a font that is used by the phrase
THE CAT SAT ON THE MAT, only T, H, E, C, A, S, O, N, M would be enough.
But not for the every case, glyphs and code points have 1 to 1 relationship. Especially when ligatures come in to play. (Yah, I know ﬂ has its own code point (
U+FB02), but for sure you would never want to type it whenever you want to search for words like “fly” or “flight”, etc.) Also, sometimes the glyphs are not located in the correct place as in its code point, but just merely 0, 1, 2, … Thus, to recognise every glyph precisely, we need a map to link glyphs and the code point(s) that it represents.
Here is an example of CMap:
There are 2 main sections in a CMap,
beginbfrange. Each of them starts with a number, representing the number of entries it has. It begins
This is usually used for one to one relationship. First code is the glyph ID, and the second means the Unicode code point. For example,
<07> <0035> means glyph 07 is mapped to code point 0035.
This is used for batch mapping in order: mapping a range in the glyph table to the Unicode code points. For example:
<31> <39> <00F2> means:
- Map 31 to 00F2
- Map 32 to 00F3
- Map 33 to 00F4
- Map 39 to 00FA
bfrange can also assign a glyph to multiple Unicode characters, which is useful in processing ligatures.
<02> <02> [<0066006C>] means that mapping glyph 02 to a sequence (0066, 006c).
In this example, we’re using a PDF generated with Xe(La)TeX as an example.
- Decompress the PDF document using
qpdf --qdf --object-streams=disable docuent.pdf decompressed.pdf
- Open the file with a suitable tool. I’d recommend using a binary/hex editor.
- Locate to the CMap of the targeting font.
bfcharto scramble a range of characters, e.g. Set a specific set of chars to code point 0:123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051528 beginbfrange<00> <01> <0000><09> <0A> <0000><23> <26> <0000><28> <3B> <0000><3F> <5B> <0000><5D> <5E> <0000><61> <7A> <0000><7B> <7C> <0000>endbfrange40 beginbfchar<02> <0000><03> <0000><04> <0000><05> <0000><06> <0000><07> <0000><08> <0000><0B> <0000><0C> <0000><0D> <0000><0E> <0000><0F> <0000><10> <0000><11> <0000><12> <0000><13> <0000><14> <0000><15> <0000><16> <0000><17> <0000><18> <0000><19> <0000><1A> <0000><1B> <0000><1C> <0000><1D> <0000><1E> <0000><1F> <0000><21> <0000><22> <0000><27> <0000><3C> <0000><3D> <0000><3E> <0000><5C> <0000><5F> <0000><60> <0000><7D> <0000><7E> <0000><7F> <0000>endbfchar
- This should be technically possible for all PDFs, but not been tested throughly.
- If you are using
LuaTeX, you may want to try some of the methods here.
- This method still retains text as text, but just scrambled when copied. Other methods such as converting the document to vector shapes or raster images can also make the text “uncopyable”, but no text will be detected directly as compared to the original document.