Simplifying Scrapers using BaseScraper

Loklak Server‘s main function is to scrape data from websites and other sources and output in different formats like JSON, xml and rss. There are many scrapers in the project that scrape data and output them, but are implemented with different design and libraries which makes them different from each other and a difficult to fix changes.

Due to variation in scrapers’ design, it is difficult to modify them and fix the same issue (any issue, if it appears) in each of them. This issue signals fault in design. To solve this problem, Inheritance can be brought into application. Thus, I created BaseScraper abstract class so that scrapers are more concentrated on fetching data from HTML and all supportive tasks like creating connection with the help of url are defined in BaseScraper.

The concept is pretty easy to implement, but for a perfect implementation, there is a need to go through the complete list of tasks a scraper does.

These are the following tasks with descriptions and how they are implemented using BaseScraper:

  1. Endpoint that triggers the scraper

Every search scraper inherits class AbstractAPIHandler. This is used to fetch get parameters from the endpoint according to which data is scraped from the scraper. The arguments from serviceImpl method is used to generate output and is returned to it as JSONObject.

For this task, the method serviceImpl has been defined in BaseScraper and method getData is implemented to return the output. This method is the driver method of the scraper.

public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights, JSONObjectWithDefault permissions) throws APIException {
    this.setExtra(call);
    return this.getData().toJSON(false, "metadata", "posts");
}

 

  1. Constructor

The constructor of Scraper defines the base URL of the website to be scraped, name of the scraper and data structure to fetch all get parameters input to the scraper. For get parameters, the Map data structure is used to fetch them from Query object.

Since every scraper has it’s own different base URL, scraper name and get parameters used, so it is implemented in respective Scrapers. QuoraProfileScraper is an example which has these variables defined.

  1. Get all input variables

To get all input variables, there are setters and getters defined for fetching them as Map from Query object in BaseScraper. There is also an abstract method getParam(). It is defined in respective scrapers to fetch the useful parameters for scraper and set them to the scraper’s class variables.

// Setter for get parameters from call object
protected void setExtra(Query call) {
    this.extra = call.getMap();
    this.query = call.get("query", "");
    this.setParam();
}

// Getter for get parameter wrt to its key
public String getExtraValue(String key) {
    String value = "";
    if(this.extra.get(key) != null) {
        value = this.extra.get(key).trim();
    }
    return value;
}

// Defination in QuoraProfileScraper
protected void setParam() {
    if(!"".equals(this.getExtraValue("type"))) {
        this.typeList = Arrays.asList(this.getExtraValue("type").trim().split("\\s*,\\s*"));
    } else {
        this.typeList = new ArrayList<String>();
        this.typeList.add("all");
        this.setExtraValue("type", String.join(",", this.typeList));
    }
}

 

  1.  URL creation for web scraper

The URL creation shall be implemented in a separate method as in TwitterScraper. The following is the rough implementation adapted from one of my pull request:

protected String prepareSearchUrl(String type) {
    URIBuilder url = null;
    String midUrl = "search/";

    try {
        switch(type) {
            case "question":
                url = new URIBuilder(this.baseUrl + midUrl);
                url.addParameter("q", this.query);
                url.addParameter("type", "question");
        .
        .
    }
    .
    .
    return url.toString();
}

 

  1. Get BufferedReader object from InputStream

getDataFromConnection method fetches the BufferedReader object from ClientConnection. This object reads the web page line by line by the scrape method to fetch data. See here.

ClientConnection connection = new ClientConnection(url);
BufferedReader br = getHtml(connection);
.
.
.
public BufferedReader getHtml(ClientConnection connection) {

    if (connection.inputStream == null) {
        return null;
    }

    BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
    return br;
}

 

  1. Scraping of data from HTML

The Scraper method for scraping data is declared abstract in BaseScraper and defined in the scraper. This can be a perfect example of implementation for BaseScraper (See code the here) and scraper (here).

  1. Output of data

The output of scrape method is fetched in Post data objects that are implemented for the respective scraper. These Post objects are added to Timeline iterator and which outputs data as JSONArray. Later the objects are output in enclosed Post object wrapper.

This data can be directly output as Post object, but adding it to iterator makes the Post Objects capable to be sorted in an order and be indexed to ElasticSearch.

 

Resources

Unifying Data from Different Scrapers of loklak server using Post

Loklak Server project is a software that scrapes data from different websites through different endpoints. It is difficult to create a single endpoint. For a single endpoint, there is a need of a decent design for using multiple scrapers. For such a task, multiple changes are needed. That is why one of the changes I introduced was Post class that acts as both wrapper and an interface for data objects of search scrapers (though implementation in scrapers is in progress).

Post is a subclass of JSONObject that helps in working with JSON data in Java. In other words, Post is a JSONObject with an identity (we call it postId) and and a timestamp of the data scraped. It is used to capture data fetched by the web-scrapers. Benefit of JSONObject as superclass is that it provides methods to capture and access data efficiently.

Why Post?

At present there is a Class MessageEntry which is the superclass of TwitterTweet (data object of TwitterScraper). It has numerous methods that can be used by data objects to clean and analyse data. But it has a disadvantage, it is a specialized for social websites like Twitter, but will become redundant for different types websites like Quora, Github, etc.

Whereas Post object is a small but powerful and flexible object with its ability to deal with data like JSONObject. It contains getter and setter methods, identity members used to provide each Post object a unique identity. It doesn’t have any methods for analysis and cleaning of data, but MessageEntry class’ methods can be used for this purpose.

Uses of Post Object

When I started working on Post Object, it could be used as marker interface for data objects. Following are the advantages I came up with it:

1) Accessing the data object of any scraper using its variable. And yes, this is the primary reason it is an interface.

2) But in addition to accessing the data objects, one can also directly use it to fetch, modify or use data without knowing the scraper it belongs. This feature is useful in Timeline iterator.

This is an example how Post interface is used to append two lists of Posts (maybe carrying different type of data) into one.

public void mergePost(PostTimeline list) {
    for (Post post: list) {
        this.add(post);
    }
}

 

Post as a wrapper object

While working on Post object, I converted it into a class to also use it as a wrapper. But why a wrapper? Wrapper can be used to wrap a list of Post objects into one object. It doesn’t have any identity or timestamp. It is just a utility to dump a pack of data objects with homogeneous attributes.

This is an example implementation of Post object as wrapper. typeArray is a wrapper which is used to store 2 arrays of data objects in it. These data object arrays are timeline objects that are saved as JSONArray objects in the Post wrapper.

    Post typeArray = new Post(true);
    switch(type) {
        case "users":
            typeArray.put("users", scrapeProfile(br, url).toArray());
            break;
        case "question":
            typeArray.put("question", scrapeQues(br, url).toArray());
            break;
        default:
            break;
    }

 

Resources:

 

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources: