Obfuscate PDF Text: Scramble Copied Text with Crafted CMap

in

Notice

  • This does not prevent your PDF from being copied. OCR nowadays is pretty advanced.
  • Backup your document before you do anything.
  • The author does not encourage plagiarism in any form.

Background: What is CMap

In a PDF file where fonts are embedded, those glyphs are stored in a table, usually following the encoding in the source font. Some of the files, considering to reduce the file size, only uses a subset of the font. For example, to store a font that is used by the phrase THE CAT SAT ON THE MAT, only T, H, E, C, A, S, O, N, M would be enough.

But not for the every case, glyphs and code points have 1 to 1 relationship. Especially when ligatures come in to play. (Yah, I know fl has its own code point (U+FB02), but for sure you would never want to type it whenever you want to search for words like “fly” or “flight”, etc.) Also, sometimes the glyphs are not located in the correct place as in its code point, but just merely 0, 1, 2, … Thus, to recognise every glyph precisely, we need a map to link glyphs and the code point(s) that it represents.

Here is an example of CMap:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <02> <b7> endcodespacerange
19 beginbfchar
<07> <03C0>
<09> <0061>
<0a> <006D>
<0b> <0070>
<1e> <02DA>
<20> <0020>
<22> <0022>
<3d> <003D>
<3f> <003F>
<59> <0059>
<5b> <005B>
<5d> <005D>
<5f> <005F>
<7d> <007D>
<84> <2014>
<85> <2013>
<90> <2019>
<b0> <00B0>
<b7> <00B7>
endbfchar
8 beginbfrange
<24> <25> <0024>
<27> <29> <0027>
<2b> <2e> <002B>
<30> <3b> <0030>
<41> <50> <0041>
<52> <57> <0052>
<61> <7b> <0061>
<8d> <8e> <201C>
endbfrange
6 beginbfrange
<02> <02> [<0066006C>]
<03> <03> [<00540068>]
<04> <04> [<00660069>]
<05> <05> [<00660074>]
<06> <06> [<00660066>]
<08> <08> [<006600660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

There are 2 main sections in a CMap, beginbfchar and beginbfrange. Each of them starts with a number, representing the number of entries it has. It begins

bfchar

This is usually used for one to one relationship. First code is the glyph ID, and the second means the Unicode code point. For example, <07> <0035> means glyph 07 is mapped to code point 0035.

bfrange

This is used for batch mapping in order: mapping a range in the glyph table to the Unicode code points. For example: <31> <39> <00F2> means:

  • Map 31 to 00F2
  • Map 32 to 00F3
  • Map 33 to 00F4
  • Map 39 to 00FA

Also, bfrange can also assign a glyph to multiple Unicode characters, which is useful in processing ligatures. <02> <02> [<0066006C>] means that mapping glyph 02 to a sequence (0066, 006c).

Procedure

In this example, we’re using a PDF generated with Xe(La)TeX as an example.

  1. Decompress the PDF document using qzip
    qpdf --qdf --object-streams=disable docuent.pdf decompressed.pdf
  2. Open the file with a suitable tool. I’d recommend using a binary/hex editor.
  3. Locate to the CMap of the targeting font.
  4. Use bfrange and/or bfchar to scramble a range of characters, e.g. Set a specific set of chars to code point 0:
8 beginbfrange
<00> <01> <0000>
<09> <0A> <0000>
<23> <26> <0000>
<28> <3B> <0000>
<3F> <5B> <0000>
<5D> <5E> <0000>
<61> <7A> <0000>
<7B> <7C> <0000>
endbfrange
40 beginbfchar
<02> <0000>
<03> <0000>
<04> <0000>
<05> <0000>
<06> <0000>
<07> <0000>
<08> <0000>
<0B> <0000>
<0C> <0000>
<0D> <0000>
<0E> <0000>
<0F> <0000>
<10> <0000>
<11> <0000>
<12> <0000>
<13> <0000>
<14> <0000>
<15> <0000>
<16> <0000>
<17> <0000>
<18> <0000>
<19> <0000>
<1A> <0000>
<1B> <0000>
<1C> <0000>
<1D> <0000>
<1E> <0000>
<1F> <0000>
<21> <0000>
<22> <0000>
<27> <0000>
<3C> <0000>
<3D> <0000>
<3E> <0000>
<5C> <0000>
<5F> <0000>
<60> <0000>
<7D> <0000>
<7E> <0000>
<7F> <0000>
endbfchar 

Notes

  • This should be technically possible for all PDFs, but not been tested throughly.
  • If you are using pdftex or LuaTeX, you may want to try some of the methods here.
  • This method still retains text as text, but just scrambled when copied. Other methods such as converting the document to vector shapes or raster images can also make the text “uncopyable”, but no text will be detected directly as compared to the original document.

References


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *