Trying to download some documents from StuDocu, namely those past papers and slides. Even you can easily get free premium for months, but there’s always annoying watermark blocking your vision. Really I have tried to write a script to programmatically remove the watermark, but PDF is Evil. So, I kinda worked out a workaround, generate PDF from the webpage.
0x000: Load all the SVG pages
Since StuDocu has a pretty easy way to get a premium, and some measures for anti-scrawler, we can simply just get a premium account, open up the document, and scroll slowly down to load all pages. From the dev tools, we can easily see that those pages are in SVG, with base64 encoded @font-face fonts embedded.
Note that if you scroll too fast, you’ll miss some pages in between. It seems that the pages are loaded by your current viewport vertical offset.
0x010: Convert to PDF
0x011: Short, simple documents
If it’s simple documents, of uniform common printing sizes, and a handleable number of pages, you can just use Chrome to print it out as PDF.
First, remove all useless elements from the web. Drop it in your OmniBar, or run it in the DevTools console.
|
|
Then print as PDF, choose the correct orientation, paper size, and don’t forget to turn on Background graphics, ant set margins to None.
Simple and elegant, right?
0x012: All other documents
But things get complicated when your browser can’t get hold on your lengthy, heavy or odd sized documents. If you are a perfectionist like me, and you’re okay with going through a detour, this is for you.
0x0120: Dependencies
Quite a few of stuff I have used for the detour.
- Python 3
- Beautiful Soup 3
- PhantomJS
- pdftk
It should be easy to install on most *nix environments.
0x0121: How to do it
Firstly, save the webpage with all the pages loaded. Save it into a new folder, and run it with the following Python script.
|
|
|
|
Then get PhantomJS to convert each SVG to PDF. Reason why I choose PhantomJS is because only WebKit can recognize @font-face
embed fonts, tool like rsvg
, inkscape
, and cairosvg
doesn’t do it so well. PhantomJS is rather more customizable compare to wkhtmltopdf
in term of page size.
|
|
|
|
Almost all of those generated PDFs comes with blank pages behind, so we need to strip them out as well.
|
|
Then join them together
|
|
This should work pretty well, I’ve tried it on several documents and all of them gives promising results.
Also, if anyone has worked out how to remove the watermark from the downloaded PDF programmatically, please share with us!
Leave a Reply