Scraping in JavaScript using Cheerio in Loklak
FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java. Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper. Parsing HTML Steps involved in web-scraping: HTML source of the webpage is obtained. HTML source is parsed and The parsed HTML is traversed to extract the required data. For 2nd and 3rd step we use cheerio. Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request. Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function. var $ = cheerio.load(htmlSourceOfWebpage); Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”. Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag. var head = cheerio.load($("head").html()); “html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter. Extracting data from parsed HTML Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute : In website (for better view): In static HTML, obtained from “GET” request using getHtml function (for better view): So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly. Now, let’s dive into the actual scraping stuff. As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value. function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) { var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]'; return cheerioFunction(selector).attr("content"); } “cheerioObject” here will be the “head”…
