Scraping in JavaScript using Cheerio in Loklak

FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java. Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper. Parsing HTML Steps involved in web-scraping: HTML source of the webpage is obtained. HTML source is parsed and The parsed HTML is traversed to extract the required data. For 2nd and 3rd step we use cheerio. Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request. Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function. var $ = cheerio.load(htmlSourceOfWebpage);   Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”. Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag. var head = cheerio.load($("head").html()); “html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter. Extracting data from parsed HTML Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute :   In website (for better view): In static HTML, obtained from “GET” request using getHtml function (for better view): So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly. Now, let’s dive into the actual scraping stuff. As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value. function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) { var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]'; return cheerioFunction(selector).attr("content"); } “cheerioObject” here will be the “head”…

Continue ReadingScraping in JavaScript using Cheerio in Loklak

Improving Harvesting Decision for Kaizen Harvester in loklak server

About Kaizen Harvester Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines - Hashtags in Tweets User mentions in Tweets Tweets from areas near to each Tweet in timeline. Tweets older than oldest Tweet in timeline. Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers. It was introduced by @yukiisbored in pull request loklak/loklak_server#960. The Problem: Unbiased Harvesting Decision The Kaizen harvester either searches for queries from the queue, or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator - @Override public int harvest() {    if (!queries.isEmpty() && random.nextBoolean())        return harvestMessages();    grabSuggestions();    return 0; } [SOURCE] In sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method. But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part. The Solution: Making Decision Biased To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps - Calculate the ratio of queue filled (q.size() / q.maxSize()). Generate a random floating point number between 0 and 1. If the number is less than the fraction, harvest. Otherwise get harvesting suggestions. Why would this work? Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions. If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions. Graph? The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken. Change in code The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in following manner - @Override public int harvest() {    float targetProb = random.nextFloat();    float prob = 0.5F;    if (QUERIES_LIMIT > 0) {        prob = queries.size() / (float)QUERIES_LIMIT;    }    if (!queries.isEmpty() && targetProb < prob) {        return harvestMessages();    }    grabSuggestions();    return 0; } [SOURCE] Conclusion This change brought enhancement in the Kaizen harvester and made it…

Continue ReadingImproving Harvesting Decision for Kaizen Harvester in loklak server