Javascript Screen Scraper / Website Downloader

Recently I needed to download the HTML source code from 9640 pages behind a login. At first I used PHP and cURL which normally works just find. With cURL, I first cURLed to the login page passing my login info. Then, in the same script, I cURLed to the pages I wanted to download and save to my local machine.  That didn’t work so I inspected the headers and set cURL’s header options to be exactly the same. That didn’t work either and I just got a 301 Moved Permanently error. It seemed like I needed a way for the remote server to think I was no different than a regular web user browsing in a web browser. So, I tried Javascript. I logged into the website, fired up Firebug, and pasted the following code into the console tab’s command line:

[cc lang=”js”]
function wait(msecs) {
var start = new Date().getTime();
var cur = start
while(cur – start < msecs) {
cur = new Date().getTime();
}
}

// make an array file containing all URLs to GET
var urls=[
“http://somedomain.com/id=1”,
“http://somedomain.com/id=2”,
“http://somedomain.com/id=3”,
“http://somedomain.com/id=4”
];

// loop over each URL
jQuery.each(urls, function() {
var theurl = this;
jQuery.get(theurl, function(data){
var myRegexp = /id=([0-9]+)/g;
var match = myRegexp.exec(theurl);

// post response to local PHP script
jQuery.post(“http://localhost/savedata.php”, { data: data, filename: match[1]+”.html”} );

// pick a random number between 30 and 120 seconds (to simulate a human user:)
var time = Math.floor(Math.random()*(90-30+1)+30);
wait(time*1000);
});
});
[/cc]

What this script does is it loops through a list of URLs, gets them, and then posts the response (HTML source code) to a local PHP script to write it to a file. Javascript can’t be used to write to a file for security reasons but that’s OK.

I ran this script right before I went to sleep and when I woke up, I had a bunch of data stored locally for me to quick post-process.

Note: when I first ran the script in Firefox, I got a “unresponsive script” warning.  You can change the timeout to prevent this by following the instructions at

http://support.mozilla.org/en-US/kb/Warning%20Unresponsive%20script

I changed my timeout to 999999 and that worked.