Obfuscate PDF Text: Scramble Copied Text with Crafted CMap

2017-08-29

—

Notice

This does not prevent your PDF from being copied. OCR nowadays is pretty advanced.
Backup your document before you do anything.
The author does not encourage plagiarism in any form.

Background: What is CMap

In a PDF file where fonts are embedded, those glyphs are stored in a table, usually following the encoding in the source font. Some of the files, considering to reduce the file size, only uses a subset of the font. For example, to store a font that is used by the phrase THE CAT SAT ON THE MAT, only T, H, E, C, A, S, O, N, M would be enough.

But not for the every case, glyphs and code points have 1 to 1 relationship. Especially when ligatures come in to play. (Yah, I know ﬂ has its own code point (U+FB02), but for sure you would never want to type it whenever you want to search for words like “fly” or “flight”, etc.) Also, sometimes the glyphs are not located in the correct place as in its code point, but just merely 0, 1, 2, … Thus, to recognise every glyph precisely, we need a map to link glyphs and the code point(s) that it represents.

Here is an example of CMap:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <02> <b7> endcodespacerange
19 beginbfchar
<07> <03C0>
<09> <0061>
<0a> <006D>
<0b> <0070>
<1e> <02DA>
<20> <0020>
<22> <0022>
<3d> <003D>
<3f> <003F>
<59> <0059>
<5b> <005B>
<5d> <005D>
<5f> <005F>
<7d> <007D>
<84> <2014>
<85> <2013>
<90> <2019>
<b0> <00B0>
<b7> <00B7>
endbfchar
8 beginbfrange
<24> <25> <0024>
<27> <29> <0027>
<2b> <2e> <002B>
<30> <3b> <0030>
<41> <50> <0041>
<52> <57> <0052>
<61> <7b> <0061>
<8d> <8e> <201C>
endbfrange
6 beginbfrange
<02> <02> [<0066006C>]
<03> <03> [<00540068>]
<04> <04> [<00660069>]
<05> <05> [<00660074>]
<06> <06> [<00660066>]
<08> <08> [<006600660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

There are 2 main sections in a CMap, beginbfchar and beginbfrange. Each of them starts with a number, representing the number of entries it has. It begins

`bfchar`

This is usually used for one to one relationship. First code is the glyph ID, and the second means the Unicode code point. For example, <07> <0035> means glyph 07 is mapped to code point 0035.

`bfrange`

This is used for batch mapping in order: mapping a range in the glyph table to the Unicode code points. For example: <31> <39> <00F2> means:

Map 31 to 00F2
Map 32 to 00F3
Map 33 to 00F4
…
Map 39 to 00FA

Also, bfrange can also assign a glyph to multiple Unicode characters, which is useful in processing ligatures. <02> <02> [<0066006C>] means that mapping glyph 02 to a sequence (0066, 006c).

Procedure

In this example, we’re using a PDF generated with Xe(La)TeX as an example.

Decompress the PDF document using qzip
qpdf --qdf --object-streams=disable docuent.pdf decompressed.pdf
Open the file with a suitable tool. I’d recommend using a binary/hex editor.
Locate to the CMap of the targeting font.
Use bfrange and/or bfchar to scramble a range of characters, e.g. Set a specific set of chars to code point 0:

8 beginbfrange
<00> <01> <0000>
<09> <0A> <0000>
<23> <26> <0000>
<28> <3B> <0000>
<3F> <5B> <0000>
<5D> <5E> <0000>
<61> <7A> <0000>
<7B> <7C> <0000>
endbfrange
40 beginbfchar
<02> <0000>
<03> <0000>
<04> <0000>
<05> <0000>
<06> <0000>
<07> <0000>
<08> <0000>
<0B> <0000>
<0C> <0000>
<0D> <0000>
<0E> <0000>
<0F> <0000>
<10> <0000>
<11> <0000>
<12> <0000>
<13> <0000>
<14> <0000>
<15> <0000>
<16> <0000>
<17> <0000>
<18> <0000>
<19> <0000>
<1A> <0000>
<1B> <0000>
<1C> <0000>
<1D> <0000>
<1E> <0000>
<1F> <0000>
<21> <0000>
<22> <0000>
<27> <0000>
<3C> <0000>
<3D> <0000>
<3E> <0000>
<5C> <0000>
<5F> <0000>
<60> <0000>
<7D> <0000>
<7E> <0000>
<7F> <0000>
endbfchar

Notes

This should be technically possible for all PDFs, but not been tested throughly.
If you are using pdftex or LuaTeX, you may want to try some of the methods here.
This method still retains text as text, but just scrambled when copied. Other methods such as converting the document to vector shapes or raster images can also make the text “uncopyable”, but no text will be detected directly as compared to the original document.

References

Tags:CMap Obfuscate PDF Scramble

Comments

Leave a Reply Cancel reply

To respond on your own website, enter the URL of your response which should contain a link to this post’s permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post’s URL again. (Find out more about Webmentions.) 要在你自己的网站上回应，请输入你的回应页面的 URL，该页面应包含指向此文章永久链接 URL 的链接。你的回应随后将显示在此页面上（可能需要经过审核）。想更新或删除你的回应？请更新或删除你的文章并再次输入你的文章 URL。(了解更多关于 Webmention 的信息。) 自分のサイトで返信するには、この投稿のパーマリンク URL へのリンクを含む返信ページの URL を入力してください。あなたの返信は（承認後に）このページに表示されます。返信を更新または削除したい場合は、あなたの投稿を更新または削除し、再度その投稿の URL を入力してください。(Webmention について詳しくはこちら。)

URL/Permalink of your article