FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java.
Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper.
Parsing HTML
Steps involved in web-scraping:
- HTML source of the webpage is obtained.
- HTML source is parsed and
- The parsed HTML is traversed to extract the required data.
For 2nd and 3rd step we use cheerio.
Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request.
Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function.
var $ = cheerio.load(htmlSourceOfWebpage);
Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”.
Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag.
var head = cheerio.load($("head").html());
“html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter.
Extracting data from parsed HTML
Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute :
- In website (for better view):
- In static HTML, obtained from “GET” request using getHtml function (for better view):
So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly.
Now, let’s dive into the actual scraping stuff.
As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value.
function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) { var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]'; return cheerioFunction(selector).attr("content"); }
“cheerioObject” here will be the “head” object created above.
For example, our final JSONObject contains a og_url key-value pair, to get that we need to obtain the following html element.
<meta property="og:url" content="https://www.youtube.com/watch?v=KVGRN7Z7T1A">
This can be obtained by:
- Writing a selector for property attribute of meta. The selector would be ‘meta[property=”og:url”]’.
- The selector is passed to cheerioObject.
- Then attr method is used to obtain the value of content attribute.
- Finally, we set the obtained value of content attribute as the value of JSONObject’s key.
Similarly og:site_name, og:url and other values can be extracted, which in the final JSONObject would be the value of keys og_site_name, og_url and similarly. Since, a lot of data needs to be extracted this way, the extractMetaAttribute function generalizes it, where metaAttribute is “property” and metaAttributeValue is “og:url” in the above example.
If one parameter is provided in attr method, then it is used as a getter method, the value of that attribute is returned. If two parameters are provided then first parameter is the name of attribute and second parameter is the value of attribute, in this case it is used as a setter method.
Now, what if the provided selector matches more than one html element and we need to extract data or perform some operations on all of them. The answer is using each method on the cheerio Object, it iterates over the matched elements and executes the passed function – as a parameter – on them. The passed function has two parameters, the index of matched element and the matched element itself. To break out of the loop early, false is returned.
One of the use case of each method in youtube scraper is to extract related “tags” of the video.
Selector for this would be ‘meta[property=”og:video:tag”]’ and as it is inside a head tag, we can use the already created head tag. Applying the each method, it becomes:
head('meta[property="og:video:tag"]').each(function(i, element) { // the logic goes here });
Here for the first iteration the value of “i” will be “0” and “element” will be
<meta property="og:video:tag" content="Iggy">
and so on. We need to obtain the value of content attribute, so we can use attr method as used above. Finally all the values are pushed to an array. Hence, the final code snippet with logic.
var ary = []; head('meta[property="og:video:tag"]').each(function(i, element) { ary.push(head(element).attr("content")); });
The same functionality is implemented in extractMetaProperties method.
function extractMetaProperties(cheerioObj, metaProperty) { var properties = []; var selector = 'meta[property="' + metaProperty + '"]'; cheerioObj(selector).each(function(i, element) { properties.push(cheerioObj(element).attr("content")); }); return properties;}