Timeline – blog.fossasia.org

Simplifying Scrapers using BaseScraper

Post author:Vibhor Verma
Post published:July 18, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Loklak Server‘s main function is to scrape data from websites and other sources and output in different formats like JSON, xml and rss. There are many scrapers in the project that scrape data and output them, but are implemented with different design and libraries which makes them different from each other and a difficult to fix changes.

Due to variation in scrapers’ design, it is difficult to modify them and fix the same issue (any issue, if it appears) in each of them. This issue signals fault in design. To solve this problem, Inheritance can be brought into application. Thus, I created BaseScraper abstract class so that scrapers are more concentrated on fetching data from HTML and all supportive tasks like creating connection with the help of url are defined in BaseScraper.

The concept is pretty easy to implement, but for a perfect implementation, there is a need to go through the complete list of tasks a scraper does.

These are the following tasks with descriptions and how they are implemented using BaseScraper:

Endpoint that triggers the scraper

Every search scraper inherits class AbstractAPIHandler. This is used to fetch get parameters from the endpoint according to which data is scraped from the scraper. The arguments from serviceImpl method is used to generate output and is returned to it as JSONObject.

For this task, the method serviceImpl has been defined in BaseScraper and method getData is implemented to return the output. This method is the driver method of the scraper.

public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights, JSONObjectWithDefault permissions) throws APIException {
    this.setExtra(call);
    return this.getData().toJSON(false, "metadata", "posts");
}

Constructor

The constructor of Scraper defines the base URL of the website to be scraped, name of the scraper and data structure to fetch all get parameters input to the scraper. For get parameters, the Map data structure is used to fetch them from Query object.

Since every scraper has it’s own different base URL, scraper name and get parameters used, so it is implemented in respective Scrapers. QuoraProfileScraper is an example which has these variables defined.

Get all input variables

To get all input variables, there are setters and getters defined for fetching them as Map from Query object in BaseScraper. There is also an abstract method getParam(). It is defined in respective scrapers to fetch the useful parameters for scraper and set them to the scraper’s class variables.

// Setter for get parameters from call object
protected void setExtra(Query call) {
    this.extra = call.getMap();
    this.query = call.get("query", "");
    this.setParam();
}

// Getter for get parameter wrt to its key
public String getExtraValue(String key) {
    String value = "";
    if(this.extra.get(key) != null) {
        value = this.extra.get(key).trim();
    }
    return value;
}

// Defination in QuoraProfileScraper
protected void setParam() {
    if(!"".equals(this.getExtraValue("type"))) {
        this.typeList = Arrays.asList(this.getExtraValue("type").trim().split("\\s*,\\s*"));
    } else {
        this.typeList = new ArrayList<String>();
        this.typeList.add("all");
        this.setExtraValue("type", String.join(",", this.typeList));
    }
}

URL creation for web scraper

The URL creation shall be implemented in a separate method as in TwitterScraper. The following is the rough implementation adapted from one of my pull request:

protected String prepareSearchUrl(String type) {
    URIBuilder url = null;
    String midUrl = "search/";

    try {
        switch(type) {
            case "question":
                url = new URIBuilder(this.baseUrl + midUrl);
                url.addParameter("q", this.query);
                url.addParameter("type", "question");
        .
        .
    }
    .
    .
    return url.toString();
}

Get BufferedReader object from InputStream

getDataFromConnection method fetches the BufferedReader object from ClientConnection. This object reads the web page line by line by the scrape method to fetch data. See here.

ClientConnection connection = new ClientConnection(url);
BufferedReader br = getHtml(connection);
.
.
.
public BufferedReader getHtml(ClientConnection connection) {

    if (connection.inputStream == null) {
        return null;
    }

    BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
    return br;
}

Scraping of data from HTML

The Scraper method for scraping data is declared abstract in BaseScraper and defined in the scraper. This can be a perfect example of implementation for BaseScraper (See code the here) and scraper (here).

Output of data

The output of scrape method is fetched in Post data objects that are implemented for the respective scraper. These Post objects are added to Timeline iterator and which outputs data as JSONArray. Later the objects are output in enclosed Post object wrapper.

This data can be directly output as Post object, but adding it to iterator makes the Post Objects capable to be sorted in an order and be indexed to ElasticSearch.

Resources

Loklak Server: https://github.com/loklak/loklak_server
ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1

Iterating the Loklak Server data

Post author:Vibhor Verma
Post published:July 18, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Loklak Server is amazing for what it does, but it is more impressive how it does the tasks. Iterators are used for and how to use them, but this project has a customized iterator that iterates Twitter data objects. This iterator is Timeline.java .

Timeline implements an interface iterable (isn’t it iterator?). This interface helps in using Timeline as an iterator and add methods to modify, use or create the data objects. At present, it only iterates Twitter data objects. I am working on it to modify it to iterate data objects from all web scrapers.

The following is a simple example of how an iterator is used.

// Initializing arraylist
List<String> stringsList = Arrays.asList("foo", "bar", "baz");

// Using iterator to display contents of stringsList
System.out.print("Contents of stringsList: ");

Iterator iter = al.iterator();
while(iter.hasNext()) {
    System.out.print(iter.next() + " ");
}

This iterator can only iterate data the way array does. (Then why do we need it?) It does the task of iterating objects perfectly, but we can add more functionality to the iterator.

Timeline iterator iterates the MessageEntry objects i.e. superclass of TwitterTweet objects. According to Javadocs, “Timeline is a structure which holds tweet for the purpose of presentation, There is no tweet retrieval method here, just an iterator which returns the tweets in reverse appearing order.”

Following are some of the tasks it does:

As an iterator:

This basic use of Timeline is to iterate the MessageEntry objects. It not only iterates the data objects, but also fetches them (See here).

// Declare Timeline object according to order the data object has been created
Timeline tline = new Timeline(Timeline.parseOrder("created_at"));

// Adding data objects to the timeline
tline.add(me1);
tline.add(me2);
.
.
.
// Outputing all data objects as array of JSON objects
for (MessageEntry me: tline) {
    JSONArray postArray = new JSONArray();
    for (MessageEntry post : this) {
        postArray.put(post.toJSON());
    }
}

The order of iterating the data objects

Timeline can arrange and iterate the data objects according to the date of creation of the twitter post, number of retweets or number of favourite counts. For this there is an Enum declaration of Order in the Timeline class which is initialized during creation of Timeline object. [link]

    Timeline tline = new Timeline(Timeline.parseOrder("created_at"));

Pagination of data objects

There is an object cursor, some methods, including getter and setters to support pagination of the data objects. It is only internally implemented, but can also be used to return a section of the result.

writeToIndex method

This method can be used to write all data fetched by Timeline iterator to ElasticSearch for indexing and to dump that can be used for testing. Thus, indexing of data can concurrently be done while it is iterated. It is implemented here.

Other methods

It also has methods to output all data as JSON and customized method to add data to Timeline keeping user object and Data separate, etc. There are a bit more things in this iterable class which shall be explored instead.

Resources:

Loklak Server: https://github.com/loklak/loklak_server
Iterable: https://docs.oracle.com/javase/8/docs/api/java/lang/Iterable.html
Use of Iterable: https://stackoverflow.com/questions/1059127/what-is-the-iterable-interface-used-for
ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1
Ways to iterate through loop: https://crunchify.com/how-to-iterate-through-java-list-4-way-to-iterate-through-loop/

Endpoint that triggers the scraper

Constructor

Get all input variables

URL creation for web scraper

Get BufferedReader object from InputStream

Scraping of data from HTML

Output of data

Resources

As an iterator:

The order of iterating the data objects

Pagination of data objects

writeToIndex method

Other methods

Resources: