- 1. 0x000: Load all the SVG pages
- 2. 0x010: Convert to PDF
Documents from StuDocu without watermark
Trying to download some documents from StuDocu, namely those past papers and slides. Even you can easily get free premium for months, but there’s always annoying watermark blocking your vision. Really I have tried to write a script to programmatically remove the watermark, but PDF is Evil. So, I kinda worked out a workaround, generate PDF from the webpage.
Since StuDocu has a pretty easy way to get a premium, and some measures for anti-scrawler, we can simply just get a premium account, open up the document, and scroll slowly down to load all pages. From the dev tools, we can easily see that those pages are in SVG, with base64 encoded @font-face fonts embedded.
Note that if you scroll too fast, you’ll miss some pages in between. It seems that the pages are loaded by your current viewport vertical offset.
If it’s simple documents, of uniform common printing sizes, and a handleable number of pages, you can just use Chrome to print it out as PDF.
First, remove all useless elements from the web. Drop it in your OmniBar, or run it in the DevTools console.
Then print as PDF, choose the correct orientation, paper size, and don’t forget to turn on Background graphics, ant set margins to None.
Simple and elegant, right?
But things get complicated when your browser can’t get hold on your lengthy, heavy or odd sized documents. If you are a perfectionist like me, and you’re okay with going through a detour, this is for you.
Quite a few of stuff I have used for the detour.
- Python 3
- Beautiful Soup 3
It should be easy to install on most *nix environments.
Firstly, save the webpage with all the pages loaded. Save it into a new folder, and run it with the following Python script.
Then get PhantomJS to convert each SVG to PDF. Reason why I choose PhantomJS is because only WebKit can recognize
@font-face embed fonts, tool like
cairosvg doesn’t do it so well. PhantomJS is rather more customizable compare to
wkhtmltopdf in term of page size.
Almost all of those generated PDFs comes with blank pages behind, so we need to strip them out as well.
Then join them together
This should work pretty well, I’ve tried it on several documents and all of them gives promising results.
Also, if anyone has worked out how to remove the watermark from the downloaded PDF programmatically, please share with us!