Multithreading implementation in Loklak Server

Loklak Server is a near-realtime system. It performs a large number of tasks and are very costly in terms of resources. Its basic function is to scrape all data from websites and output it at the endpoint. In addition to scraping data, there is also a need to perform other tasks like refining and cleaning of data. That is why, multiple threads are instantiated. They perform other tasks like:

  1. Refining of data and extract more data

The data fetched needs to be cleaned and refined before outputting it. Some of the examples are:

a) Removal of html tags from tweet text:

After extracting text from html data and feeding to TwitterTweet object, it concurrently runs threads to remove all html from text.

b) Unshortening of url links:

The url links embedded in the tweet text may track the users with the help of shortened urls. To prevent this issue, a thread is instantiated to unshorten the url links concurrently while cleaning of tweet text.

  1. Indexing all JSON output data to ElasticSearch

While extracting JSON data as output, there is a method here in Timeline.java that indexes data to ElasticSearch.

Managing multithreading

To manage multithreading, Loklak Server applies following objects:

1. ExecutorService

To deal with large numbers of threads ExecutorService object is used to handle threads as it helps JVM to prevent any resource overflow. Thread’s lifecycle can be controlled and its creation cost can be optimized. This is the best example of ExecutorService application is here:

.
.
public class TwitterScraper {
    // Creation of at max 40 threads. This sets max number of threads to 40 at a time
    public static final ExecutorService executor = Executors.newFixedThreadPool(40);
    .
    .
    .
    .
    // Feeding of TwitterTweet object with data
    TwitterTweet tweet = new TwitterTweet(
        user.getScreenName(),
        Long.parseLong(tweettimems.value),
        props.get("tweettimename").value,
        props.get("tweetstatusurl").value,
        props.get("tweettext").value,
        Long.parseLong(tweetretweetcount.value),
        Long.parseLong(tweetfavouritecount.value),
        imgs, vids, place_name, place_id,
        user, writeToIndex,  writeToBackend
    );
    // Starting thread to refine TwitterTweet data
    if (tweet.willBeTimeConsuming()) {
       executor.execute(tweet);
    }    .
    .
    .

 

2. basic Thread class

Thread class can also be used instead of ExecutorService in cases where there is no resource crunch. But it is always suggested to use ExecutorService due to its benefits. Thread implementation can be used as an anonymous class like here.

3. Runnable interface

Runnable interface can be used to create an anonymous class or classes which does more task than just a task concurrently. In Loklak Server, TwitterScraper concurrently indexes the data to ElasticSearch, unshortens link and cleans data. Have a look at implementation here.

Resources:

Continue ReadingMultithreading implementation in Loklak Server

Unifying Data from Different Scrapers of loklak server using Post

Loklak Server project is a software that scrapes data from different websites through different endpoints. It is difficult to create a single endpoint. For a single endpoint, there is a need of a decent design for using multiple scrapers. For such a task, multiple changes are needed. That is why one of the changes I introduced was Post class that acts as both wrapper and an interface for data objects of search scrapers (though implementation in scrapers is in progress).

Post is a subclass of JSONObject that helps in working with JSON data in Java. In other words, Post is a JSONObject with an identity (we call it postId) and and a timestamp of the data scraped. It is used to capture data fetched by the web-scrapers. Benefit of JSONObject as superclass is that it provides methods to capture and access data efficiently.

Why Post?

At present there is a Class MessageEntry which is the superclass of TwitterTweet (data object of TwitterScraper). It has numerous methods that can be used by data objects to clean and analyse data. But it has a disadvantage, it is a specialized for social websites like Twitter, but will become redundant for different types websites like Quora, Github, etc.

Whereas Post object is a small but powerful and flexible object with its ability to deal with data like JSONObject. It contains getter and setter methods, identity members used to provide each Post object a unique identity. It doesn’t have any methods for analysis and cleaning of data, but MessageEntry class’ methods can be used for this purpose.

Uses of Post Object

When I started working on Post Object, it could be used as marker interface for data objects. Following are the advantages I came up with it:

1) Accessing the data object of any scraper using its variable. And yes, this is the primary reason it is an interface.

2) But in addition to accessing the data objects, one can also directly use it to fetch, modify or use data without knowing the scraper it belongs. This feature is useful in Timeline iterator.

This is an example how Post interface is used to append two lists of Posts (maybe carrying different type of data) into one.

public void mergePost(PostTimeline list) {
    for (Post post: list) {
        this.add(post);
    }
}

 

Post as a wrapper object

While working on Post object, I converted it into a class to also use it as a wrapper. But why a wrapper? Wrapper can be used to wrap a list of Post objects into one object. It doesn’t have any identity or timestamp. It is just a utility to dump a pack of data objects with homogeneous attributes.

This is an example implementation of Post object as wrapper. typeArray is a wrapper which is used to store 2 arrays of data objects in it. These data object arrays are timeline objects that are saved as JSONArray objects in the Post wrapper.

    Post typeArray = new Post(true);
    switch(type) {
        case "users":
            typeArray.put("users", scrapeProfile(br, url).toArray());
            break;
        case "question":
            typeArray.put("question", scrapeQues(br, url).toArray());
            break;
        default:
            break;
    }

 

Resources:

 

Continue ReadingUnifying Data from Different Scrapers of loklak server using Post

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Continue ReadingCreate Scraper in Javascript for Loklak Scraper JS

Best Practices when writing Tests for loklak Server

Why do we write unit-tests? We write them to ensure that developers’ implementation doesn’t change the behaviour of parts of the project. If there is a change in the behaviour, unit-tests throw errors. This keep developers in ease during integration of the software and ensure lower chances of unexpected bugs.

After setting up the tests in Loklak Server, we were able to check whether there is any error or not in the test. Test failures didn’t mention the error and the exact test case at which they failed. It was YoutubeScraperTest that brought some of the best practices in the project. We modified the tests according to it.

The following are some of the best practices in 5 points that we shall follow while writing unit tests:

  1. Assert the assertions

There are many assert methods which we can use like assertNull, assertEquals etc. But we should use one which describes the error well (being more descriptive) so that developer’s effort is reduced while debugging.

Using these assertions related preferences help in getting to the exact errors on test fails, thus helping in easier debugging of the code.

Some examples can be:-

  • Using assertThat() over assertTrue

assertThat() give more descriptive errors over assertTrue(). Like:-

When assertTrue() is used:

java.lang.AssertionError: Expected: is <true> but: was <false> at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at org.hamcr.......... 

 

When assertThat() is used:

java.lang.AssertionError:
Expected: is <true>
     but: was <false>
at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at org.hamcr...........

 

NOTE:- In many cases, assertThat() is preferred over other assert method (read this), but in some cases other methods are used to give better descriptive output (like in next examples)

  • Using assertEquals() over assertThat()

For assertThat()

java.lang.AssertionError:

Expected: is "ar photo #test #car https://pic.twitter.com/vd1itvy8Mx"

but: was "car photo #test #car https://pic.twitter.com/vd1itvy8Mx"

at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)

at org.junit.Assert.assertThat(Ass........

 

For assertEquals()

org.junit.ComparisonFailure: expected:<[c]ar photo #test #car ...> but was:<[]ar photo #test #car ...>

at org.junit.Assert.assertEquals(Assert.java:115)

at org.junit.Assert.assertEquals(Assert.java:144)

at org.loklak.harvester.Twitter.........

 

We can clearly see that second example gives better error description than the first one.(An SO link)

  1. One Test per Behaviour

Each test shall be independent of other with none having mutual dependencies. It shall test only a specific behaviour of the module that is tested.

Have a look of this snippet. This test checks the method that creates the twitter url by comparing the output url method with the expected output url.

@Test

public void testPrepareSearchURL() {

    String url;

    String[] query = {

        "fossasia", "from:loklak_test",

        "spacex since:2017-04-03 until:2017-04-05"

    };

    String[] filter = {"video", "image", "video,image", "abc,video"};

    String[] out_url = {

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

    };

    // checking simple urls

    for (int i = 0; i < query.length; i++) {

        url = TwitterScraper.prepareSearchURL(query[i], "");


        //compare urls with urls created

        assertThat(out_url[i], is(url));

    }

}

 

This unit-test tests whether the method-under-test is able to create twitter link according to query or not.

  1. Selecting test cases for the test

We shall remember that testing is a very costly task in terms of processing. It takes time to execute. That is why, we need to keep the test cases precise and limited. In loklak server, most of the tests are based on connection to the respective websites and this step is very costly. That is why, in implementation, we must use least number of test cases so that all possible corner cases are covered.

  1. Test names

Descriptive test names that are short but give hint about their task which are very helpful. A comment describing what it does is a plus point. The following example is from YoutubeScraperTest. I added this point to my ‘best practices queue’ after reviewing the code (when this module was in review process).

/**

* When try parse video from input stream should check that video parsed.

* @throws IOException if some problem with open stream for reading data.

*/

@Test

public void whenTryParseVideoFromInputStreamShouldCheckThatJSONObjectGood() throws IOException {

    //Some tests related to method

}

 

AND the last one, accessing methods

This point shall be kept in mind. In loklak server, there are some tests that use Reflection API to access private and protected methods. This is the best example for reflection API.

In general, such changes to access specifiers are not allowed, that is why we shall resolve this issue with the help of:-

  •  Setters and Getters (if available, use it or else create them)
  •  Else use Reflection

If the getter methods are not available, using Reflection API will be the last resort to access the private and protected members of the class. Hereunder is a simple example of how a private method can be accessed using Reflection:

void getPrivateMethod() throws Exception {

    A ret = new A();

    Class<?> clazz = ret.getClass();

    Method method = clazz.getDeclaredMethod("changeValue", Integer.TYPE);

    method.setAccessible(true);

    System.out.println(method.invoke(ret, 2)); 
    //set null if method is static

}

 

I should end here. Try applying these practices, go through the links and get sync with these ‘Best Practices’ 🙂

Resources:

Continue ReadingBest Practices when writing Tests for loklak Server

Writing Simple Unit-Tests with JUnit

In the Loklak Server project, we use a number of automation tools like the build testing tool ‘TravisCI’, automated code reviewing tool ‘Codacy’, and ‘Gemnasium’. We are also using JUnit, a java-based unit-testing framework for writing automated Unit-Tests for java projects. It can be used to test methods to check their behaviour whenever there is any change in implementation. These unit-tests are handy and are coded specifically for the project. In the Loklak Server project it is used to test the web-scrapers. Generally JUnit is used to check if there is no change in behaviour of the methods, but in this project, it also helps in keeping check if the website code has been modified, affecting the data that is scraped.

Let’s start with basics, first by setting up, writing a simple Unit-Tests and then Test-Runners. Here we will refer how unit tests have been implemented in Loklak Server to familiarize with the JUnit Framework.

Setting-UP

Setting up JUnit with gradle is easy, You have to do just 2 things:-

1) Add JUnit dependency in build.gradle

Dependencies {

. . .

. . .<other compile groups>. . .

compile group: 'com.twitter', name: 'jsr166e', version: '1.1.0'

compile group: 'com.vividsolutions', name: 'jts', version: '1.13'

compile group: 'junit', name: 'junit', version: '4.12'

compile group: 'org.apache.logging.log4j', name: 'log4j-1.2-api', version: '2.6.2'

compile group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.6.2'

. . .

. . .

}

 

2) Add source for ‘test’ task from where tests are built (like here).

Save all tests in test directory and keep its internal directory structure identical to src directory structure. Now set the path in build.gradle so that they can be compiled.

sourceSets.test.java.srcDirs = ['test']

 

Writing Unit-Tests

In JUnit FrameWork a Unit-Test is a method that tests a particular behaviour of a section of code. Test methods are identified by annotation @Test.

Unit-Test implements methods of source files to test their behaviour. This can be done by fetching the output and comparing it with expected outputs.

The following test tests if twitter url that is created is valid or not that is to be scraped.

/**

 * This unit-test tests twitter url creation

 */

@Test

public void testPrepareSearchURL() {

String url;

String[] query = {"fossasia", "from:loklak_test",

"spacex since:2017-04-03 until:2017-04-05"};

String[] filter = {"video", "image", "video,image", "abc,video"};

String[] out_url = {

"https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

"https://twitter.com/search?f=tweets&vertical=default&q=from%3Aloklak_test&src=typd",

"and other output url strings to be matched…..."

};


// checking simple urls

for (int i = 0; i < query.length; i++) {

url = TwitterScraper.prepareSearchURL(query[i], "");


//compare urls with urls created

assertThat(out_url[i], is(url));

}


// checking urls having filters

for (int i = 0; i < filter.length; i++) {

url = TwitterScraper.prepareSearchURL(query[0], filter[i]);


//compare urls with urls created

assertThat(out_url[i+3], is(url));

}

}

 

Testing the implementation of code is useless as it will either make code more difficult to change or tests useless  . So be cautious while writing tests and keep difference between Implementation and Behaviour in mind.

This is the perfect example for a simple Unit-Test. As we see there are some points, which needs to be observed like:-

1) There is a annotation @Test .

2) Input array of query which is fed to the method TwitterScraper.prepareSearchURL() .

3) Array of urls out_url[], which are the expected urls to output.

4) asserThat() to compare the expected url (in array out_url[]) and the output url (in variable ‘url’).

NOTE: assertEquals() could also be used here, but we prefer to use assert methods to get error message that is readable (We will discuss about this some time later)

And the TestRunner

When we are working on a project, It is not feasible to run tests using gradle as they are first built  (else verified whether tests are build-ready) and then executed. gradle test shall be used only for building and testing the tests. For testing the project, one shall set-up TestRunner. It allows to run specific set of tests, one wants to run.

TestRunners are built once using gradle (with other tests) and can be run whenever you want. Also it is easy to stack up the test classes you want to run in SuiteClasses and @RunWith to run SuiteClasses with the TestRunner.

In loklak server, TestRunner runs the web-scraper tests. They are used by developers to test the changes they have made.

This is a sample TestRunner, code link here .

package org.loklak;


// Library classes imported

import org.junit.runner.RunWith;

import org.junit.runners.Suite;

// Source files to be tested

import org.loklak.harvester.TwitterScraperTest;

import org.loklak.harvester.YoutubeScraperTest;


/*

* TestRunner for harvesters

*/

@RunWith(Suite.class)

@Suite.SuiteClasses({

TwitterScraperTest.class,

YoutubeScraperTest.class

})

public class TestRunner {

}

 

You can also add TestRunners for different sections of the project. Like here it is initialized only to test harvesters.

To run the TestRunner

Add classpath of the jar file of the project and run ‘JUnitCore’ with TestRunner to get output on terminal.

java -classpath .:build/libs/<yourProject>.jar:build/classes/test org.junit.runner.JUnitCore org.loklak.TestRunner

In the project we have set up a shell script to run the tests.

Few points

1) Build the project and tests separately. Build tests only when changed as they take time to be built and executed.

2) Whenever you are done with the coding part, run the tests using TestRunner.

3) Write unit-tests whenever you add a new feature to the project to keep it up-to-date.

Now lets end up here.

So for now, Code it, Test it and Repeat.

Resources:

Continue ReadingWriting Simple Unit-Tests with JUnit

This API or that Library – which one?

Last week, I was playing with a scraper program in Loklak Server project when I came across a library Boilerpipe. There were some issues in the program related to it’s implementation. It worked well. I implemented it, pulled a request but was rejected due to it’s maintenance issues. This wasn’t the first time an API(or a library) has let me down, but this added one more point to my ‘Linear Selection Algorithm’ to select one.

Once Libraries revolutionized the Software Projects and now API‘s are taking abstraction to a greater level. One can find many API’s and libraries on GitHub or on their respective websites, but they may be buggy. This may lead to waste of one’s time and work. I am not blogging to suggest which one to choose between the two, but what to check before getting them into use in development.

So let us select a bunch of these and give score +1 if it satisfies the point, 0 for Don’t care condition and -1 , a BIG NO.

Now initialize the variable score to zero and lets begin.

1. First thing first. is it easy to understand

Does this library code belongs to your knowledge domain? Can you use it without any issue? Also consider your project’s platform compatibility with the library. If you are developing a prototype or a small software(like for an event like Hackathon), you shall choose easy-to-read tutorial as higher priority and score++. But if you are working on a project, you shouldn’t shy going an extra mile and retain the value of score.

2. Does it have any documentation or examples of implementation

It shall have to be well written, well maintained documentation. If it doesn’t, I am ok with examples. Choose well according to your comfort. If none, at least code shall be easy to understand.

3. Does it fulfill all my needs?

Test and try to implement all the methods/ API calls needed for the project. Sometimes it may not have all the methods you need for your application or may be some methods are buggy. Take care of this point, a faulty library can ruin all your hard work.

4. Efficiency and performance (BONUS POINT for this one)

Really important for projects with high capacity/performance issues.

5. See for the Apps where they are implemented

If you are in a hackathon or a dev sprint, Checking for applications working on this API shall work. Just skip the rest of the steps (except the first).

6. Can you find blogs, Stack Overflow questions and tutorials?

If yes, This is a score++

7. An Active Community, a Super GO!

Yaay! An extra plus with the previous point.

8. Don’t tell me it isn’t maintained

This is important as if the library isn’t maintained, you are prone to bugs that may pop up in  future and couldn’t be solved. Also it’s performance can never be improved. If there is no option, It is better to use it’s parts in your code so that you can work on it, if needed.

Now calculate the scores, choose the fittest one and get to work.

So with the deserving library in your hand, my first blog post here ends.

Continue ReadingThis API or that Library – which one?