scraping – blog.fossasia.org

Scraping in JavaScript using Cheerio in Loklak

Post author:Siddhant Kumar Patel
Post published:July 4, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java.

Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper.

Parsing HTML

Steps involved in web-scraping:

HTML source of the webpage is obtained.
HTML source is parsed and
The parsed HTML is traversed to extract the required data.

For 2nd and 3rd step we use cheerio.

Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request.

Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function.

var $ = cheerio.load(htmlSourceOfWebpage);

Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”.

Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag.

var head = cheerio.load($("head").html());

“html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter.

Extracting data from parsed HTML

Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute :

In website (for better view):

In static HTML, obtained from “GET” request using getHtml function (for better view):

So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly.

Now, let’s dive into the actual scraping stuff.

As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value.

function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) {
	var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]';
	return cheerioFunction(selector).attr("content");
}

“cheerioObject” here will be the “head” object created above.

For example, our final JSONObject contains a og_url key-value pair, to get that we need to obtain the following html element.

<meta property="og:url" content="https://www.youtube.com/watch?v=KVGRN7Z7T1A">

This can be obtained by:

Writing a selector for property attribute of meta. The selector would be ‘meta[property=”og:url”]’.
The selector is passed to cheerioObject.
Then attr method is used to obtain the value of content attribute.
Finally, we set the obtained value of content attribute as the value of JSONObject’s key.

Similarly og:site_name, og:url and other values can be extracted, which in the final JSONObject would be the value of keys og_site_name, og_url and similarly. Since, a lot of data needs to be extracted this way, the extractMetaAttribute function generalizes it, where metaAttribute is “property” and metaAttributeValue is “og:url” in the above example.

If one parameter is provided in attr method, then it is used as a getter method, the value of that attribute is returned. If two parameters are provided then first parameter is the name of attribute and second parameter is the value of attribute, in this case it is used as a setter method.

Now, what if the provided selector matches more than one html element and we need to extract data or perform some operations on all of them. The answer is using each method on the cheerio Object, it iterates over the matched elements and executes the passed function – as a parameter – on them. The passed function has two parameters, the index of matched element and the matched element itself. To break out of the loop early, false is returned.

One of the use case of each method in youtube scraper is to extract related “tags” of the video.

Selector for this would be ‘meta[property=”og:video:tag”]’ and as it is inside a head tag, we can use the already created head tag. Applying the each method, it becomes:

head('meta[property="og:video:tag"]').each(function(i, element) {
    // the logic goes here
});

Here for the first iteration the value of “i” will be “0” and “element” will be

<meta property="og:video:tag" content="Iggy">

and so on. We need to obtain the value of content attribute, so we can use attr method as used above. Finally all the values are pushed to an array. Hence, the final code snippet with logic.

var ary = [];
head('meta[property="og:video:tag"]').each(function(i, element) {
    ary.push(head(element).attr("content"));
});

The same functionality is implemented in extractMetaProperties method.

function extractMetaProperties(cheerioObj, metaProperty) {
	var properties = [];
	var selector = 'meta[property="' + metaProperty + '"]';
	cheerioObj(selector).each(function(i, element) {
		properties.push(cheerioObj(element).attr("content"));
	});
	return properties;}

Improving Harvesting Decision for Kaizen Harvester in loklak server

Post author:Pratyush
Post published:July 3, 2017
Post category:GSoC loklak
Post comments:0 Comments

About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

Hashtags in Tweets
User mentions in Tweets
Tweets from areas near to each Tweet in timeline.
Tweets older than oldest Tweet in timeline.

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by @yukiisbored in pull request loklak/loklak_server#960.

The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue, or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();

   grabSuggestions();

   return 0;
}

[SOURCE]

In sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

Calculate the ratio of queue filled (q.size() / q.maxSize()).
Generate a random floating point number between 0 and 1.
If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions.

Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

Change in code

The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in following manner –

@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}

[SOURCE]

Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to backend for suggestions whose queries are not added to the queue.

Resources

Current state of Kaizen Harvester – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy/KaizenHarvester.java.
Kaizen Harvester usage guide – https://github.com/loklak/loklak_server/blob/development/docs/kaizen.md.
Code used to generate the graph – https://gist.github.com/singhpratyush/8292b6fc815e5a18311848f635724f99.