Unifying Data from Different Scrapers of loklak server using Post

Loklak Server project is a software that scrapes data from different websites through different endpoints. It is difficult to create a single endpoint. For a single endpoint, there is a need of a decent design for using multiple scrapers. For such a task, multiple changes are needed. That is why one of the changes I introduced was Post class that acts as both wrapper and an interface for data objects of search scrapers (though implementation in scrapers is in progress).

Post is a subclass of JSONObject that helps in working with JSON data in Java. In other words, Post is a JSONObject with an identity (we call it postId) and and a timestamp of the data scraped. It is used to capture data fetched by the web-scrapers. Benefit of JSONObject as superclass is that it provides methods to capture and access data efficiently.

Why Post?

At present there is a Class MessageEntry which is the superclass of TwitterTweet (data object of TwitterScraper). It has numerous methods that can be used by data objects to clean and analyse data. But it has a disadvantage, it is a specialized for social websites like Twitter, but will become redundant for different types websites like Quora, Github, etc.

Whereas Post object is a small but powerful and flexible object with its ability to deal with data like JSONObject. It contains getter and setter methods, identity members used to provide each Post object a unique identity. It doesn’t have any methods for analysis and cleaning of data, but MessageEntry class’ methods can be used for this purpose.

Uses of Post Object

When I started working on Post Object, it could be used as marker interface for data objects. Following are the advantages I came up with it:

1) Accessing the data object of any scraper using its variable. And yes, this is the primary reason it is an interface.

2) But in addition to accessing the data objects, one can also directly use it to fetch, modify or use data without knowing the scraper it belongs. This feature is useful in Timeline iterator.

This is an example how Post interface is used to append two lists of Posts (maybe carrying different type of data) into one.

public void mergePost(PostTimeline list) {
    for (Post post: list) {
        this.add(post);
    }
}

 

Post as a wrapper object

While working on Post object, I converted it into a class to also use it as a wrapper. But why a wrapper? Wrapper can be used to wrap a list of Post objects into one object. It doesn’t have any identity or timestamp. It is just a utility to dump a pack of data objects with homogeneous attributes.

This is an example implementation of Post object as wrapper. typeArray is a wrapper which is used to store 2 arrays of data objects in it. These data object arrays are timeline objects that are saved as JSONArray objects in the Post wrapper.

    Post typeArray = new Post(true);
    switch(type) {
        case "users":
            typeArray.put("users", scrapeProfile(br, url).toArray());
            break;
        case "question":
            typeArray.put("question", scrapeQues(br, url).toArray());
            break;
        default:
            break;
    }

 

Resources:

 

Continue ReadingUnifying Data from Different Scrapers of loklak server using Post