Loklak Scraper JS, as suggested by the name, is a set of scrapers for social media websites written in NodeJS. One of the most common requirement while scraping is, there is a parent webpage which provides links for related child webpages. And the required data that needs to be scraped is present in both parent webpage and child webpages. For example, let’s say we want to scrape quora user profiles matching search query “Siddhant”. The matching profiles webpage for this example will be https://www.quora.com/search?q=Siddhant&type=profile which is the parent webpage, and the child webpages are links of each matched profiles.
Now, a simplistic approach is to first obtain the HTML of parent webpage and then synchronously fetch the HTML of child webpages and parse them to get the desired data. The problem with this approach is that, it is slower as it is synchronous.
A different approach can be using request-promise-native to implement the logic in asynchronous way. But, there are limitations with this approach. The HTML of child webpages that needs to be fetched can only be obtained after HTML of parent webpage is obtained and number of child webpages are dynamic. So, there is a request dependency between parent and child i.e. if only we have data from parent webpage we can extract data from child webpages. The code would look like this
request(parent_url) .then(data => { ... request(child_url) .then(data => { // again nesting of child urls }) .catch(error => { }); }) .catch(error => { });
Firstly, with this approach there is callback hell. Horrible, isn’t it? And then we don’t know how many nested callbacks to use as the number of child webpages are dynamic.
The saviour: RxJS
The solution to our problem is reactive extensions in JavaScript. Using rxjs we can obtain the required data without callback hell and asynchronously!
The promise-request object of the parent webpage is obtained. Using this promise-request object an observable is generated by using Rx.Observable.fromPromise. flatmap operator is used to parse the HTML of the parent webpage and obtain the links of child webpages. Then map method is used transform the links to promise-request objects which are again transformed into observables. The returned value – HTML – from the resulting observables is parsed and accumulated using zip operator. Finally, the accumulated data is subscribed. This is implemented in getScrapedData method of Quora JS scraper.
getScrapedData(query, callback) { // observable from parent webpage Rx.Observable.fromPromise(this.getSearchQueryPromise(query)) .flatMap((t, i) => { // t is html of parent webpage // request-promise object of child webpages let profileLinkPromises = this.getProfileLinkPromises(t); // request-promise object to observable transformation let obs = profileLinkPromises.map(elem => Rx.Observable.fromPromise(elem)); // each Quora profile is parsed return Rx.Observable.zip( // accumulation of data from child webpages ...obs, (...profileLinkObservables) => { let scrapedProfiles = []; for (let i = 0; i < profileLinkObservables.length; i++) { let $ = cheerio.load(profileLinkObservables[i]); scrapedProfiles.push(this.scrape($)); } return scrapedProfiles; // accumulated data returned } ) }) .subscribe( // desired data is subscribed scrapedData => callback({profiles: scrapedData}), error => callback(error) ); }
Resources:
- RxJS tutorial: http://reactivex.io/rxjs/manual/tutorial.html