Notice
- This does not prevent your PDF from being copied. OCR nowadays is pretty advanced.
- Backup your document before you do anything.
- The author does not encourage plagiarism in any form.
Background: What is CMap
In a PDF file where fonts are embedded, those glyphs are stored in a table, usually following the encoding in the source font. Some of the files, considering to reduce the file size, only uses a subset of the font. For example, to store a font that is used by the phrase THE CAT SAT ON THE MAT
, only T, H, E, C, A, S, O, N, M would be enough.
But not for the every case, glyphs and code points have 1 to 1 relationship. Especially when ligatures come in to play. (Yah, I know fl has its own code point (U+FB02
), but for sure you would never want to type it whenever you want to search for words like “fly” or “flight”, etc.) Also, sometimes the glyphs are not located in the correct place as in its code point, but just merely 0, 1, 2, … Thus, to recognise every glyph precisely, we need a map to link glyphs and the code point(s) that it represents.
Here is an example of CMap:
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <02> <b7> endcodespacerange
19 beginbfchar
<07> <03C0>
<09> <0061>
<0a> <006D>
<0b> <0070>
<1e> <02DA>
<20> <0020>
<22> <0022>
<3d> <003D>
<3f> <003F>
<59> <0059>
<5b> <005B>
<5d> <005D>
<5f> <005F>
<7d> <007D>
<84> <2014>
<85> <2013>
<90> <2019>
<b0> <00B0>
<b7> <00B7>
endbfchar
8 beginbfrange
<24> <25> <0024>
<27> <29> <0027>
<2b> <2e> <002B>
<30> <3b> <0030>
<41> <50> <0041>
<52> <57> <0052>
<61> <7b> <0061>
<8d> <8e> <201C>
endbfrange
6 beginbfrange
<02> <02> [<0066006C>]
<03> <03> [<00540068>]
<04> <04> [<00660069>]
<05> <05> [<00660074>]
<06> <06> [<00660066>]
<08> <08> [<006600660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
There are 2 main sections in a CMap, beginbfchar
and beginbfrange
. Each of them starts with a number, representing the number of entries it has. It begins
bfchar
This is usually used for one to one relationship. First code is the glyph ID, and the second means the Unicode code point. For example, <07> <0035>
means glyph 07 is mapped to code point 0035.
bfrange
This is used for batch mapping in order: mapping a range in the glyph table to the Unicode code points. For example: <31> <39> <00F2>
means:
- Map 31 to 00F2
- Map 32 to 00F3
- Map 33 to 00F4
- …
- Map 39 to 00FA
Also, bfrange
can also assign a glyph to multiple Unicode characters, which is useful in processing ligatures. <02> <02> [<0066006C>]
means that mapping glyph 02 to a sequence (0066, 006c).
Procedure
In this example, we’re using a PDF generated with Xe(La)TeX as an example.
- Decompress the PDF document using
qzip
qpdf --qdf --object-streams=disable docuent.pdf decompressed.pdf
- Open the file with a suitable tool. I’d recommend using a binary/hex editor.
- Locate to the CMap of the targeting font.
- Use
bfrange
and/orbfchar
to scramble a range of characters, e.g. Set a specific set of chars to code point 0:
8 beginbfrange
<00> <01> <0000>
<09> <0A> <0000>
<23> <26> <0000>
<28> <3B> <0000>
<3F> <5B> <0000>
<5D> <5E> <0000>
<61> <7A> <0000>
<7B> <7C> <0000>
endbfrange
40 beginbfchar
<02> <0000>
<03> <0000>
<04> <0000>
<05> <0000>
<06> <0000>
<07> <0000>
<08> <0000>
<0B> <0000>
<0C> <0000>
<0D> <0000>
<0E> <0000>
<0F> <0000>
<10> <0000>
<11> <0000>
<12> <0000>
<13> <0000>
<14> <0000>
<15> <0000>
<16> <0000>
<17> <0000>
<18> <0000>
<19> <0000>
<1A> <0000>
<1B> <0000>
<1C> <0000>
<1D> <0000>
<1E> <0000>
<1F> <0000>
<21> <0000>
<22> <0000>
<27> <0000>
<3C> <0000>
<3D> <0000>
<3E> <0000>
<5C> <0000>
<5F> <0000>
<60> <0000>
<7D> <0000>
<7E> <0000>
<7F> <0000>
endbfchar
Notes
- This should be technically possible for all PDFs, but not been tested throughly.
- If you are using
pdftex
orLuaTeX
, you may want to try some of the methods here. - This method still retains text as text, but just scrambled when copied. Other methods such as converting the document to vector shapes or raster images can also make the text “uncopyable”, but no text will be detected directly as compared to the original document.
Leave a Reply