Documents from StuDocu without watermark

in

Trying to download some documents from StuDocu, namely those past papers and slides. Even you can easily get free premium for months, but there’s always annoying watermark blocking your vision. Really I have tried to write a script to programmatically remove the watermark, but PDF is Evil. So, I kinda worked out a workaround, generate PDF from the webpage.

0x000: Load all the SVG pages

Since StuDocu has a pretty easy way to get a premium, and some measures for anti-scrawler, we can simply just get a premium account, open up the document, and scroll slowly down to load all pages. From the dev tools, we can easily see that those pages are in SVG, with base64 encoded @font-face fonts embedded.

Note that if you scroll too fast, you’ll miss some pages in between. It seems that the pages are loaded by your current viewport vertical offset.

0x010: Convert to PDF

0x011: Short, simple documents

If it’s simple documents, of uniform common printing sizes, and a handleable number of pages, you can just use Chrome to print it out as PDF.

First, remove all useless elements from the web. Drop it in your OmniBar, or run it in the DevTools console.

1
javascript:(function(){var a = "", x = document.getElementsByTagName("svg"); for(var i = 0; i < x.length; i++){a += x[i].outerHTML;} document.getElementsByTagName("body")[0].innerHTML = a;var a = document.getElementsByTagName("svg");for (var i = 0; i < a.length; i++){a[i].style.width="99.8%";a[i].style.height="auto";a[i].style.position="inherit";a[i].style.display="block";a[i].style.boxShadow="0 3px 3px rgba(0,0,0,0.3)";a[i].style.padding="0";}})()

Then print as PDF, choose the correct orientation, paper size, and don’t forget to turn on Background graphics, ant set margins to None.

Simple and elegant, right?

0x012: All other documents

But things get complicated when your browser can’t get hold on your lengthy, heavy or odd sized documents. If you are a perfectionist like me, and you’re okay with going through a detour, this is for you.

0x0120: Dependencies

Quite a few of stuff I have used for the detour.

  • Python 3
    • Beautiful Soup 3
  • PhantomJS
  • pdftk

It should be easy to install on most *nix environments.

0x0121: How to do it

Firstly, save the webpage with all the pages loaded. Save it into a new folder, and run it with the following Python script.

1
2
3
4
5
6
7
8
9
10
11
12
import bs4
import sys
if len(sys.argv) < 2:
print("Usage: %s file" % sys.argv[0])
exit()
s = bs4.BeautifulSoup(open(sys.argv[1]))
x = s.find_all('svg')
digits = len(str(len(x)))
p = "Page%0{}d.svg".format(digits)
for i in range(len(x)):
with open(p % i, 'w') as f:
f.write(str(x[i]))
1
$ python3 html2svg.py myhtmldoc.html

Then get PhantomJS to convert each SVG to PDF. Reason why I choose PhantomJS is because only WebKit can recognize @font-face embed fonts, tool like rsvg, inkscape, and cairosvg doesn’t do it so well. PhantomJS is rather more customizable compare to wkhtmltopdf in term of page size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
"use strict";
var page = require('webpage').create(),
system = require('system'),
address, output;
if (system.args.length < 3 || system.args.length > 5) {
console.log('Usage: svg2pdf.js URL');
phantom.exit(1);
} else {
address = system.args[1];
output = system.args[2];
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit(1);
} else {
window.setTimeout(function () {
var x = page.evaluate(function(){
var n = document.querySelector('svg');
var a = n.getAttribute("viewbox").split(/\s+|,/);
return a;
});
page.paperSize = {
width: x[2]+"px",
height: (x[3])+"px",
margin: "0"
};
page.render(output);
phantom.exit();
}, 200);
}
});
}
1
$ for i in *.svg; do phantomjs svg2pdf.js ${i%%.svg}.pdf; done;

Almost all of those generated PDFs comes with blank pages behind, so we need to strip them out as well.

1
$ for i in *.pdf; do pdftk $i cat 1 output ${i%%.pdf}.1.pdf; done;

Then join them together

1
$ pdftk *.1.pdf cat output output.pdf

This should work pretty well, I’ve tried it on several documents and all of them gives promising results.

Also, if anyone has worked out how to remove the watermark from the downloaded PDF programmatically, please share with us!


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *