Is it possible to scrape an HTML page with JavaScript from inside of a web browser?
- The 5 Top JavaScript Web Scraping Libraries in 2020# 1. Axios# Axios is a promise-based HTTP client for the browser and Node.js. But why exactly this library? There are a lot of libraries that can be used instead of a well-known request: got, superagent, node-fetch. But Axios is a suitable solution not only for Node.js but for client usage too.
- Web scraping at full power. With these tools we can log into sites, click, scroll, execute JavaScript and more. Cloud Functions + Scheduler Many scrapers are useless unless you deploy them.
- A very common flow that web applications use to load their data is to have JavaScript make asynchronous requests (AJAX) to an API server (typically REST or GraphQL) and receive their data back in JSON format, which then gets rendered to the screen.
Web Scraping Node Js
To be perfectly honest I wasn’t sure so I decided to try it out.
Web scraping is referred to as the process of getting data from websites (and their databases). It may as well be called Data scraping, data collection, data extraction, data harvesting, data mining, etc. People collect data manually but there is an automated part of making computers do the hard work while you are asleep or away.
Full disclaimer here, I didn’t actually succeed. However, it was a great learning experience for me and I think you guys could benefit from seeing what I did and where I went wrong. Who knows, maybe you can take what I’ve done and figure it out for yourself!
You can jump to any of these methods if you like…
CORS | No Referer header request | WordPress pages’ load |
Let’s say you’re at mysite.com (in your browser) and want to run a script to load some data from example.com. The simplest solution for requesting content via JavaScript would be to request it through an XML Http Request (XHR):
However, this pertains to the cross-origin requests (as opposed to same-origin request), which turns out to be the unbreakable wall for requesting page content from pure JavaScript.
Same Origin Policy
Same-origin policy requires that in web requests made from the client side (mysite.com) both domain name and protocol (http, https) should be equal. So security limitations do not allow you to request another domain website (example.com) or any of its resources.
Can you bypass it? No.
You might think you can just bypass this. In order to request foreign domain resources we imitate the same domain origin. How? By obfuscating Referer header in XML Http Request, see the following:
This is explicitly disallowed by the XHR specification. So in JavaScript XHR you can’t explicitly set up an arbitrary header.
Resources allowed to be requested from a foreign domain
Now, there are services that do allow cross-origin resource sharing. This is applicable for distributed resource sharing to diminish a server resource load. Eg. CSS stylesheets, images, and scripts might be served from foreign domain servers. Here are some examples of resources which may be embedded cross-origin:
- JavaScript with
<script src='...'></script>
. Error messages for syntax errors are only available for same-origin scripts. - CSS with
<link href='...'>
. - Images with
<img>
. Supported image formats include PNG, JPEG, GIF, BMP, SVG. - Media files with
<video>
and<audio>
. - Plug-ins with
<object>
,<
embed
>
and<applet>
. - Fonts with
@font-face
. Some browsers allow cross-origin fonts, others require same-origin fonts. - Anything with
<frame>
and<iframe>
. A site can use theX-Frame-Options
header to prevent this form of cross-origin interaction.
One example is the jQuery code which is often served from ajax.googleapis.com domain:
Node Js Web Scraping
But, to the developers’ joy the Cross-Origin Resource Sharing policy has been recently (January 2014) introduced. This allows to resolve the Same-Origin Policy restriction.
Cross-Origin Resource Sharing (CORS)
The main concept is that a target server may allow some other origins (or all of them) to request its resources. Server configured for allowing cross-origin requests is useful for the cross-domain API access of its resources.
If a server allows CORS it’ll respond with Access-Control-Allow-Origin:* header.
If a resource owner’s restricts the sharing with only a certain domain, the server will respond with:
You might do a preflight request to make clear if a server allows foreign domain access.
Read more of CORS here.
How to set an access control in Apach server (enabling CORS) see here.
CORS tester.
Wrap up
Eventually, site owners will allow CORS only for the API access since it’s unlikely they will make their private web data cross-origin accessable. The attempt to scrape other sites’ content with JavaScript provides a very limited scope.
No Referer form submission
We’ve mentioned before that <iframe> loading foreing data in it works by neglecting same-domain policy.
Let’s try to use the form submission with no Referer header. Most of the sites approve the request if Referer header is empty (omitted). Websites do this because they don’t want to lose sort of 1% of their traffic. So we make a simple procedure that is called for a chosen domain with requesting thru virtual form submission:
This code, when called client-side, adds new <iframe> into a web page and loads needed resource into a browser page. The whole code is here. Kind of loading.
See the following web sniffer’s shot showing the Origin header being null and no Referer header present.
So basically you might load a foreign page into your browser page by JavaScript. But still the Same Origin Policy, applied in all major browsers, forbids access to the fetched HTML. Cross-origin contents can not be read by JavaScript. No major browser will allow that to secure against XSS attacks. Surprisingly, you can watch the cross-site request response HTML code thru browser’s web developer tools (F12, an example of using them) and manually copy/paste it:
The loaded site will seamlessly work in an iframe, yet, you can’t have an access to its HTML. You can get the page’s screenshot as an image, but it’s not sufficient for full-scale web scraping.
How does WordPress load foreign page shots into its admin panel
WordPress CMS can load of foreign resources with server-side call (if having access to wp-admin – just visit: <site>/wp-admin/edit-comments.php). If the user hovers a comment website link, the CMS’ JavaScript automatically invokes a request to the WordPress home server:
Now the CMS makes HTTP request to its own server, embedding the link to the foreign resource. Obviously the WordPress server makes request to the resource by provided link of interest and returns the content:
The only the thing is that the content returned by WordPress being an image: content-type: image/jpeg. You can program server to return HTML code, but that’s server-side data extraction.
Conclusion
The client-side (from your browser) scraping with JavaScript is not practical today. (1) The browser capabilities are far less compared to web servers (speed, memory, etc.). (2) The Same-Origin Policy safeguards sites from cross-origin requests, avoiding XSS attacks threat. CORS is limited scope applicable. (3) I’ve also tried to emulate the cross-domain HTTP request by a virtual form submission to load a result into an iframe, but this failed since the browser restrictions forbid scripts to handle raw response HTML cause of XSS attacks threat. (4) The last option is the indirect requesting thru a domain server (mysite.com, who actually extracts). Eg. WordPress loading foreign pages’ previews.
Feel free to add more to this topic (using comments).
In this article, I’d like to list some most popular Javascript open-source projects that can be useful for web scraping. It consists of both libraries and standalone niche scrapers that can scrape a particular site (Amazon, iTunes, Instagram, Google Play, etc.)
Awesome Open Source Javascript Projects for Web Scraping#
HTTP interaction#
- Axios: Promise based HTTP client for the browser and node.js.
Features: XMLHttpRequests from the browser, HTTP requests from node.js, Promise API, intercepting of request and response, transforming of request and response, automatic transforming for JSON data. - Got: Human-friendly and powerful HTTP request library for Node.js.
Features: HTTP/2 support, Promise API, Stream API, Pagination API, Cookies (out-of-box), Progress events. - Superagent: Small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features.
Features: HTTP/2 support, Promise API, Stream API, Request cancelation, Follows redirects, Retries on failure, Progress events.
DOM manipulation and HTML parsing#
- Cheerio: Fast, flexible & lean implementation of core jQuery designed specifically for the server.
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. - jsdom:
jsdom
is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications. - htmlparser2: A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface.This module started as a fork of the
htmlparser
module. The main difference is thathtmlparser2
is intended to be used only with NodeJs (it runs on other platforms usingbrowserify
).htmlparser2
was rewritten multiple times and, while it maintains an API that's compatible with htmlparser in most cases, the projects don't share any code anymore.
Javascript execution and rendering#
- Puppeteer: Puppeteer is a NodeJS library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
- Awesome resources for Puppeteer: https://github.com/transitive-bullshit/awesome-puppeteer
- Selenium: Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides an infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers.
- Playwright: Playwright is a Node library to automate Chromium, Firefox, and WebKit with a single API. Playwright is built to enable cross-browser web automation that is ever-green, capable, reliable, and fast.
- PhantomJS: PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. Fast and native implementation of web standards: DOM, CSS, JavaScript, Canvas, and SVG. No emulation!
Resource scrapers#
- amazon-scraper: Useful tool to scrape product information from the amazon
- app-store-scraper: Node.js module to scrape application data from the iTunes/Mac App Store.
- instagram-scraper: Since Instagram has removed the option to load public data through its API, this actor should help replace this functionality.
- google-play-scraper: Node.js module to scrape application data from the Google Play store.
- scrapedin: Scraper for LinkedIn full profile data. Unlike other scrapers, it's working in 2020 with their new website.
- tiktok-scraper: Scrape and download useful information from TikTok.
And it's only the most interesting ones. Feel free to browse through Github to find out your best one!
Conclusion#
Web Scraping Json
JavaScript is not as popular a programming language for web scrapers as Python, but the community is growing and this list definitely will get bigger over some time.
Web Scraping Using Js
Also, our scraping API is language agnostic, so you can check it even if you're not very familiar with JS or Python.