Some Options for Scraping Web Pages

Node.js + jQuery–net-22478

This tutorial shows how to use Node.js with the Express web server framework, the Jade template engine, and the JSDOM parser. You may prefer the Handlebars template engine. This tutorial also includes use of the node “request” and “url” modules.

Node.js + Cheerio

If performance is an issue, you can replace JSDOM and jQuery with Cheerio. Cheerio is 16x faster than JSDOM and can handle complex websites.

Node.js + Nightmare JS

Nightmare JS is similar to PhantomJS but it is based on Electron.

The following example shows how to extract the H1 text and all LI (list) HTML tags and write them to a file.

If you’d like to scrape a bunch of web pages, you can run Nightmare in a loop using the Async library to process URLs in batches and to ensure correct processing.

Saving to File

In all 3 cases, if you need to save your scraped data to a file, you can use the fs module as shown in the Nightmare example above.

Saving to a File in a Non-existent Path

If the path to write the file to doesn’t exist, fs.createWriteStream will throw an error. The path must exist. In that case, you can dynamically create the path as follows.

Browser Scope

Note that variables outside of the “evaluate()” function are NOT accessible within that function because within the evaluate() function, you are in the browser scope. Whatever code you can run in the browser console can run in the evaluate() function.

Note: there’s also Headless Chrome and Puppeteer