diGriz's Chunk of Web

bash$ :(){ :|:&};:


Leeching an Offline Copy of a Moodle Site

Moodle is a popular alternative to Blackboard, and rightly so as Blackboard really does suck more than a milking machine (from what I have been told by the poor users and staff forced to use it).

The downside with Moodle (from the student side) is that to get to the course material you have to be online. The type of people who sign up on a remote learning course typically are always be on the road and either unable to afford a 'roaming' Internet connection or the lack of regular access to an Internet connection will affect your studies.

My brother ran into this issue and asked if I could provide him with a 'portable' copy of his course material. Looking to the Internet all I could find was the usual luser whining and the 'official' Moodle response that "sorry Moodle is not kitted out for web crawling".

Being a bright lad I pulled apart, client side, what was going on and worked out that by using wget to do the complicated cookie authentication dance and httrack to do the actual crawling and archiving. The cookie authentication support in 'httrack' is not flexible enough to log in successfully into Moodle, hence why we get wget to do this.

Below are the instructions on how to use 'wget' and 'httrack' to Leech the site.

Damnit Just Tell Me How

Okay okay, just replace USERNAME with your login username and PASSWORD with your password for the Moodle site. Of course you need to replace www.example.co.uk with the correct hostname and you might need to adjust the path.

The filters passed to 'httrack' limit the leech to just the lessons linked off the starting URL (in the below example, 'http://www.example.co.uk/mod/lesson/index.php?id=5'), hopefully all the forums, messaging, and calendar spiel will be skipped.

So run the following, I suggest you do it from within a directory called 'websites' or 'leech' for tidyness reasons:

 wget --spider --keep-session-cookies \
   --save-cookies cookies.txt http://www.example.co.uk/login/index.php
 
 # unsure why we cannot use '--spider' here
 wget --keep-session-cookies \
   --load-cookies cookies.txt --save-cookies cookies.txt \
   --post-data="username=USERNAME&password=PASSWORD&testcookies=1" \
   http://www.example.co.uk/login/index.php 
 
 # we have to fetch the page referred to the above wget otherwise it
 # fails :-/  Look at the contents of the previous wget ('login.php')
 # to see where you are redirected to and use that URL here.  For me
 # it was back to the main homepage, you might find it different for you
 #
 # unsure why we cannot use '--spider' here
 wget --keep-session-cookies \
    --load-cookies cookies.txt --save-cookies cookies.txt \
    http://www.example.co.uk/
 
 # tidy up the un-needed files (as '--spider' does not work)
 rm index.*
 
 # now the actual download
 httrack -b1 "http://www.example.co.uk/mod/lesson/index.php?id=5" \
    -* +www.example.co.uk/* -www.example.co.uk/message/* \
    -www.example.co.uk/login/* -www.example.co.uk/index.php?cal_* \
    -www.example.co.uk/calendar/* -www.example.co.uk/help* \
    -www.example.co.uk/user/* -www.example.co.uk/grade/* \
    -www.example.co.uk/mod/* +www.example.co.uk/mod/lesson/* \
    +*.jpg +*.gif +*.png