Unifying Data from Different Scrapers of loklak server using Post

Loklak Server project is a software that scrapes data from different websites through different endpoints. It is difficult to create a single endpoint. For a single endpoint, there is a need of a decent design for using multiple scrapers. For such a task, multiple changes are needed. That is why one of the changes I introduced was Post class that acts as both wrapper and an interface for data objects of search scrapers (though implementation in scrapers is in progress).

Post is a subclass of JSONObject that helps in working with JSON data in Java. In other words, Post is a JSONObject with an identity (we call it postId) and and a timestamp of the data scraped. It is used to capture data fetched by the web-scrapers. Benefit of JSONObject as superclass is that it provides methods to capture and access data efficiently.

Why Post?

At present there is a Class MessageEntry which is the superclass of TwitterTweet (data object of TwitterScraper). It has numerous methods that can be used by data objects to clean and analyse data. But it has a disadvantage, it is a specialized for social websites like Twitter, but will become redundant for different types websites like Quora, Github, etc.

Whereas Post object is a small but powerful and flexible object with its ability to deal with data like JSONObject. It contains getter and setter methods, identity members used to provide each Post object a unique identity. It doesn’t have any methods for analysis and cleaning of data, but MessageEntry class’ methods can be used for this purpose.

Uses of Post Object

When I started working on Post Object, it could be used as marker interface for data objects. Following are the advantages I came up with it:

1) Accessing the data object of any scraper using its variable. And yes, this is the primary reason it is an interface.

2) But in addition to accessing the data objects, one can also directly use it to fetch, modify or use data without knowing the scraper it belongs. This feature is useful in Timeline iterator.

This is an example how Post interface is used to append two lists of Posts (maybe carrying different type of data) into one.

public void mergePost(PostTimeline list) {
    for (Post post: list) {
        this.add(post);
    }
}

 

Post as a wrapper object

While working on Post object, I converted it into a class to also use it as a wrapper. But why a wrapper? Wrapper can be used to wrap a list of Post objects into one object. It doesn’t have any identity or timestamp. It is just a utility to dump a pack of data objects with homogeneous attributes.

This is an example implementation of Post object as wrapper. typeArray is a wrapper which is used to store 2 arrays of data objects in it. These data object arrays are timeline objects that are saved as JSONArray objects in the Post wrapper.

    Post typeArray = new Post(true);
    switch(type) {
        case "users":
            typeArray.put("users", scrapeProfile(br, url).toArray());
            break;
        case "question":
            typeArray.put("question", scrapeQues(br, url).toArray());
            break;
        default:
            break;
    }

 

Resources:

 

CSS Styling Tips Used for loklak Apps

Cascading Style Sheets (CSS) is one of the main factors which is valuable to create beautiful and dynamic websites. So we use CSS for styling our apps in apps.loklak.org.

In this blog post am going to tell you about few rules and tips for using CSS when you style your App:

1.Always try something new – The loklak apps website is very flexible according to the user whomsoever creates an app. The user is always allowed to use any new CSS frameworks to create an app.

2.Strive for Simplicity – As the app grows, we’ll start developing a lot more than we imagine like many CSS rules and elements etc. Some of the rules may also override each other without we noticing it. It’s good practice to always check before adding a new style rule—maybe an existing one could apply.

3.Proper Structured file –

  • Maintain uniform spacing.
  • Always use semantic or “familiar” class/id names.
  • Follow DRY (Don’t Repeat Yourself) Principle.

CSS file of Compare Twitter Profiles App:

#searchBar {
    width:500px;
}

table {
  border-collapse: collapse;
  width: 70%;
}

th, td {
  padding: 8px;
  text-align: center;
  border-bottom: 1px solid#ddd;
}

 

The output screen of the app:


Do’s and Don’ts while using CSS:

  • Pages must continue to work when style sheets are disabled. In this case this means that the apps which are written in apps.loklak.org should run in any and every case. Let’s say for instance, when a user uses a old browsers or bugs or either because of style conflicts.
  • Do not use the !important attribute to override the user’s settings. Using the !important declaration is often considered bad practice because it has side effects that mess with one of CSS’s core mechanisms: specificity. In many cases, using it could indicate poor CSS architecture.
  • If you have multiple style sheets, then make sure to use the same CLASS names for the same concept in all of the style sheets.
    Do not use more than two fonts. Using a lot of fonts simply because you can will result in a messy look.
  • A firm rule for home page design is more is less : the more buttons and options you put on the home page, the less users are capable of quickly finding the information they need.

Resources:

How the Compare Twitter Profiles loklak App works

People usually have a tendency to compare their profiles with others, So this is what exactly this app is used for: To compare Twitter profiles. loklak provides so many API’s which serves different functionalities. One among those API’s which I am using to implement this app is loklak’s User Details API. This API actually help in getting all the details of the user we search giving the user name as the query. In this app am going to implement a comparison between two twitter profiles which is shown in the form of tables on the output screen.

Usage of loklak’s User Profile API in the app:

In this app when the user given in the user names in the search fields as seen below:

The queries entered into the search field are taken and used as query in the User Profile API. The query in the code is taken in the following form:

var userQueryCommand = 'http://api.loklak.org/api/user.json?' +
                       'callback=JSON_CALLBACK&screen_name=' +
                       $scope.query;

var userQueryCommand1 = 'http://api.loklak.org/api/user.json?' +
                        'callback=JSON_CALLBACK&screen_name=' +
                        $scope.query1;

The query return a json output from which we fetch details which we need. A simple query and its json output:

http://api.loklak.org/api/user.json?screen_name=fossasia

Sample json output:

{
  "search_metadata": {"client": "162.158.50.42"},
  "user": {
    "$P": "I",
    "utc_offset": -25200,
    "friends_count": 282,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1141238022/fossasia-cubelogo_normal.jpg",
    "listed_count": 185,
    "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/882420659/14d1d447527f8524c6aa0c568fb421d8.jpeg",
    "default_profile_image": false,
    "favourites_count": 1877,
    "description": "#FOSSASIA #OpenTechSummit 2017, March 17-19 in Singapore https://t.co/aKhIo2s1Ck #OpenTech community of developers & creators #Code #Hardware #OpenDesign",
    "created_at": "Sun Jun 20 16:13:15 +0000 2010",
    "is_translator": false,
    "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/882420659/14d1d447527f8524c6aa0c568fb421d8.jpeg",
    "protected": false,
    "screen_name": "fossasia",
    "id_str": "157702526",
    "profile_link_color": "DD2E44",
    "is_translation_enabled": false,
    "translator_type": "none",
    "id": 157702526,
    "geo_enabled": true,
    "profile_background_color": "F50000",
    "lang": "en",
    "has_extended_profile": false,
    "profile_sidebar_border_color": "000000",
    "profile_location": null,
    "profile_text_color": "333333",
    "verified": false,
    "profile_image_url": "http://pbs.twimg.com/profile_images/1141238022/fossasia-cubelogo_normal.jpg",
    "time_zone": "Pacific Time (US & Canada)",
    "url": "http://t.co/eLxWZtqTHh",
    "contributors_enabled": false,
    "profile_background_tile": true,
}

 

I am getting data from the json outputs as shown above, I use different fields from the json output like screen_name, favourites_count etc.

Injecting data from loklak API response using Angular:

As the loklak’s user profile API returns a json format file, I am using Angular JS to align the data according to the needs in the app.

I am using JSONP to retrieve the data from the API. JSONP or “JSON with padding” is a JSON extension wherein a prefix is specified as an input argument of the call itself. This how it is written in code:

$http.jsonp(String(userQueryCommand)).success(function (response) {
    $scope.userData = response.user;
 });

Here the response is stored into a $scope is an application object here. Using the $scope.userData variable , we access the data and display it on the screen using Javascript, HTML and CSS.

<div id="contactCard" style="pull-right">
    <div class="panel panel-default">
        <div class="panel-heading clearfix">
            <h3 class="panel-title pull-left">User 1 Profile</h3>
        </div>
        <div class="list-group">
            <div class="list-group-item">
                <img src="{{userData.profile_image_url}}" alt="" style="pull-left">
                <h4 class="list-group-item-heading" >{{userData.name}}</h4>
            </div>

In this app am also adding keyboard action and validations of fields which will not allow users to search for an empty query using this simple line in the input field.

ng-keyup="$event.keyCode == 13 && query1 != '' && query != '' ? Search() : null"

 


Resources:

Introducing Priority Kaizen Harvester for loklak server

In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak.

Background, motivation and approach

Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester.

The produce consumer imbalance in Kaizen harvester

Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156)

Learnings from attempt to add blocking queue for queries

As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them –

public class BlockingKaizenHarvester extends KaizenHarvester {
   ...
   public BlockingKaizenHarvester() {
       super(new KaizenQueries() {
           ...
           private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize);

           @Override
           public boolean addQuery(String query) {
               if (this.queries.contains(query)) {
                   return false;
               }
               try {
                   this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS);
                   return true;
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't add query: " + query, e);
                   return false;
               }
           }
           @Override
           public String getQuery() {
               try {
                   return this.queries.take();
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't get any query", e);
                   return null;
               }
           }
           ...
       });
   }
}

[SOURCE, loklak/loklak_server#1210]

But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted).

Therefore, using a blocking queue in KaizenQueries is not a good thing to do.

Other considerations

After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues.

Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how much queue size are we allowing and the queries would eventually get dropped.

Persistent disk queues would allow us to store greater number of queries but the major disadvantage was lookup time. It would terribly slow to check if a query already exists in the disk queue when the queue is large. Also, since the queries would always increase practically, the disk queue would also go out of hand at some point in time.

So by now, we were clear that not dropping queries is not an alternative. So what we had to use the limited size queue smartly so that we do not drop queries that are important.

Solution: Priority Queue

So a good solution to our problem was a priority queue. We could assign a higher score to queries that come from more popular Tweets and they would go higher in the queue and do not drop off until we have even higher priority queried in the queue.

Assigning score to a Tweet

Score for a tweet was decided using the following formula –

α= 5* (retweet count)+(favourite count)

score=α/(α+10*exp(-0.01*α))

This equation generates a score between zero and one from the retweet and favourite count of a Tweet. This normalisation of score would ensure we do not assign an insanely large score to Tweets with a high retweet and favourite count. You can see the behaviour for the second mentioned equation here.

Graph?

Changes required in existing Kaizen harvester

To take a score into account, it became necessary to add an interface to also provide a score as a parameter to the addQuery() method in KaizenQueries. Also, not all queries can have a score associated with it, for example, if we add a query that would search for Tweets older than the oldest in the current timeline, giving it a score wouldn’t be possible as it would not be associated with a single Tweet. To tackle this, a default score of 0.5 was given to these queries –

public abstract class KaizenQueries {

   public boolean addQuery(String query) {
       return this.addQuery(query, 0.5);
   }

   public abstract boolean addQuery(String query, double score);
   ...
}

[SOURCE]

Defining appropriate KaizenQueries object

The KaizenQueries object for a priority queue had to define a wrapper class that would hold the query and its score together so that they could be inserted in a queue as a single object.

ScoreWrapper and comparator

The ScoreWrapper is a simple class that stores score and query object together –

private class ScoreWrapper {

   private double score;
   private String query;

   ScoreWrapper(String m, double score) {
       this.query = m;
       this.score = score;
   }

}

[SOURCE]

In order to define a way to sort the ScoreWrapper objects in the priority queue, we need to define a Comparator for it –

private Comparator<ScoreWrapper> scoreComparator = (scoreWrapper, t1) -> (int) (scoreWrapper.score - t1.score);

[SOURCE]

Putting things together

Now that we have all the ingredients to declare our priority queue, we can also declare the strategy to getQuery and putQuery in the corresponding KaizenQueries object –

public class PriorityKaizenHarvester extends KaizenHarvester {

   private static class PriorityKaizenQueries extends KaizenQueries {
       ...
       private Queue<ScoreWrapper> queue;
       private int maxSize;

       public PriorityKaizenQueries(int size) {
           this.maxSize = size;
           queue = new PriorityQueue<>(size, scoreComparator);
       }

       @Override
       public boolean addQuery(String query, double score) {
           ScoreWrapper sw = new ScoreWrapper(query, score);
           if (this.queue.contains(sw)) {
               return false;
           }
           try {
               this.queue.add(sw);
               return true;
           } catch (IllegalStateException e) {
               return false;
           }
       }

       @Override
       public String getQuery() {
           return this.queue.poll().query;
       }
       ...
}

[SOURCE]

Conclusion

In this blog post, I discussed the process in which PriorityKaizen harvester was introduced to loklak. This strategy is a flavour of Kaizen harvester which uses a priority queue to store queries that are to be processed. These changes were possible because of a previous patch which allowed extending of Kaizen harvester.

The changes were introduced in pull request loklak/loklak#1240 by @singhpratyush (me).

Resources

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Best Practices when writing Tests for loklak Server

Why do we write unit-tests? We write them to ensure that developers’ implementation doesn’t change the behaviour of parts of the project. If there is a change in the behaviour, unit-tests throw errors. This keep developers in ease during integration of the software and ensure lower chances of unexpected bugs.

After setting up the tests in Loklak Server, we were able to check whether there is any error or not in the test. Test failures didn’t mention the error and the exact test case at which they failed. It was YoutubeScraperTest that brought some of the best practices in the project. We modified the tests according to it.

The following are some of the best practices in 5 points that we shall follow while writing unit tests:

  1. Assert the assertions

There are many assert methods which we can use like assertNull, assertEquals etc. But we should use one which describes the error well (being more descriptive) so that developer’s effort is reduced while debugging.

Using these assertions related preferences help in getting to the exact errors on test fails, thus helping in easier debugging of the code.

Some examples can be:-

  • Using assertThat() over assertTrue

assertThat() give more descriptive errors over assertTrue(). Like:-

When assertTrue() is used:

java.lang.AssertionError: Expected: is <true> but: was <false> at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at org.hamcr.......... 

 

When assertThat() is used:

java.lang.AssertionError:
Expected: is <true>
     but: was <false>
at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at org.hamcr...........

 

NOTE:- In many cases, assertThat() is preferred over other assert method (read this), but in some cases other methods are used to give better descriptive output (like in next examples)

  • Using assertEquals() over assertThat()

For assertThat()

java.lang.AssertionError:

Expected: is "ar photo #test #car https://pic.twitter.com/vd1itvy8Mx"

but: was "car photo #test #car https://pic.twitter.com/vd1itvy8Mx"

at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)

at org.junit.Assert.assertThat(Ass........

 

For assertEquals()

org.junit.ComparisonFailure: expected:<[c]ar photo #test #car ...> but was:<[]ar photo #test #car ...>

at org.junit.Assert.assertEquals(Assert.java:115)

at org.junit.Assert.assertEquals(Assert.java:144)

at org.loklak.harvester.Twitter.........

 

We can clearly see that second example gives better error description than the first one.(An SO link)

  1. One Test per Behaviour

Each test shall be independent of other with none having mutual dependencies. It shall test only a specific behaviour of the module that is tested.

Have a look of this snippet. This test checks the method that creates the twitter url by comparing the output url method with the expected output url.

@Test

public void testPrepareSearchURL() {

    String url;

    String[] query = {

        "fossasia", "from:loklak_test",

        "spacex since:2017-04-03 until:2017-04-05"

    };

    String[] filter = {"video", "image", "video,image", "abc,video"};

    String[] out_url = {

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

    };

    // checking simple urls

    for (int i = 0; i < query.length; i++) {

        url = TwitterScraper.prepareSearchURL(query[i], "");


        //compare urls with urls created

        assertThat(out_url[i], is(url));

    }

}

 

This unit-test tests whether the method-under-test is able to create twitter link according to query or not.

  1. Selecting test cases for the test

We shall remember that testing is a very costly task in terms of processing. It takes time to execute. That is why, we need to keep the test cases precise and limited. In loklak server, most of the tests are based on connection to the respective websites and this step is very costly. That is why, in implementation, we must use least number of test cases so that all possible corner cases are covered.

  1. Test names

Descriptive test names that are short but give hint about their task which are very helpful. A comment describing what it does is a plus point. The following example is from YoutubeScraperTest. I added this point to my ‘best practices queue’ after reviewing the code (when this module was in review process).

/**

* When try parse video from input stream should check that video parsed.

* @throws IOException if some problem with open stream for reading data.

*/

@Test

public void whenTryParseVideoFromInputStreamShouldCheckThatJSONObjectGood() throws IOException {

    //Some tests related to method

}

 

AND the last one, accessing methods

This point shall be kept in mind. In loklak server, there are some tests that use Reflection API to access private and protected methods. This is the best example for reflection API.

In general, such changes to access specifiers are not allowed, that is why we shall resolve this issue with the help of:-

  •  Setters and Getters (if available, use it or else create them)
  •  Else use Reflection

If the getter methods are not available, using Reflection API will be the last resort to access the private and protected members of the class. Hereunder is a simple example of how a private method can be accessed using Reflection:

void getPrivateMethod() throws Exception {

    A ret = new A();

    Class<?> clazz = ret.getClass();

    Method method = clazz.getDeclaredMethod("changeValue", Integer.TYPE);

    method.setAccessible(true);

    System.out.println(method.invoke(ret, 2)); 
    //set null if method is static

}

 

I should end here. Try applying these practices, go through the links and get sync with these ‘Best Practices’ 🙂

Resources:

URL Unshortening in Java for loklak server

There are many URL shortening services on the internet. They are useful in converting really long URLs to shorter ones. But apart from redirecting to a longer URL, they are often used to track the people visiting those links.

One of the components of loklak server is its URL unshortening and redirect resolution service, which ensures that websites can’t track the users using those links and enhances the protection of privacy. How this service works in loklak.

Redirect Codes in HTTP

Various standards define 3XX status codes as an indication that the client must perform additional actions to complete the request. These response codes range from 300 to 308, based on the type of redirection.

To check the redirect code of a request, we must first make a request to some URL –

String urlstring = "http://tinyurl.com/8kmfp";
HttpRequestBase req = new HttpGet(urlstring);

Next, we will configure this request to disable redirect and add a nice Use-Agent so that websites do not block us as a robot –

req.setConfig(RequestConfig.custom().setRedirectsEnabled(false).build());
req.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36");

Now we need a HTTP client to execute this request. Here, we will use Apache’s CloseableHttpClient

CloseableHttpClient httpClient = HttpClients.custom()
                                   .setConnectionManager(getConnctionManager(true))
                                   .setDefaultRequestConfig(defaultRequestConfig)
                                   .build();

The getConnctionManager returns a pooling connection manager that can reuse the existing TCP connections, making the requests very fast. It is defined in org.loklak.http.ClientConnection.

Now we have a client and a request. Let’s make our client execute the request and we shall get an HTTP entity on which we can work.

HttpResponse httpResponse = httpClient.execute(req);
HttpEntity httpEntity = httpResponse.getEntity();

Now that we have executed the request, we can check the status code of the response by calling the corresponding method –

if (httpEntity != null) {
   int httpStatusCode = httpResponse.getStatusLine().getStatusCode();
   System.out.println("Status code - " + httpStatusCode);
} else {
   System.out.println("Request failed");
}

Hence, we have the HTTP code for the requests we make.

Getting the Redirect URL

We can simply check for the value of the status code and decide whether we have a redirect or not. In the case of a redirect, we can check for the “Location” header to know where it redirects.

if (300 <= httpStatusCode && httpStatusCode <= 308) {
   for (Header header: httpResponse.getAllHeaders()) {
       if (header.getName().equalsIgnoreCase("location")) {
           redirectURL = header.getValue();
       }
   }
}

Handling Multiple Redirects

We now know how to get the redirect for a URL. But in many cases, the URLs redirect multiple times before reaching a final, stable location. To handle these situations, we can repeatedly fetch redirect URL for intermediate links until we saturate. But we also need to take care of cyclic redirects so we set a threshold on the number of redirects that we have undergone –

String urlstring = "http://tinyurl.com/8kmfp";
int termination = 10;
while (termination-- > 0) {
   String unshortened = getRedirect(urlstring);
   if (unshortened.equals(urlstring)) {
       return urlstring;
   }
   urlstring = unshortened;
}

Here, getRedirect is the method which performs single redirect for a URL and returns the same URL in case of non-redirect status code.

Redirect with non-3XX HTTP status – meta refresh

In addition to performing redirects through 3XX codes, some websites also contain a <meta http-equiv=”refresh” … > which performs an unconditional redirect from the client side. To detect these types of redirects, we need to look into the HTML content of a response and parse the URL from it. Let us see how –

String getMetaRedirectURL(HttpEntity httpEntity) throws IOException {
   StringBuilder sb = new StringBuilder();
   BufferedReader reader = new BufferedReader(new InputStreamReader(httpEntity.getContent()));
   String content = null;
   while ((content = reader.readLine()) != null) {
       sb.append(content);
   }
   String html = sb.toString();
   html = html.replace("\n", "");
   if (html.length() == 0)
       return null;
   int indexHttpEquiv = html.toLowerCase().indexOf("http-equiv=\"refresh\"");
   if (indexHttpEquiv < 0) {
       return null;
   }
   html = html.substring(indexHttpEquiv);
   int indexContent = html.toLowerCase().indexOf("content=");
   if (indexContent < 0) {
       return null;
   }
   html = html.substring(indexContent);
   int indexURLStart = html.toLowerCase().indexOf(";url=");
   if (indexURLStart < 0) {
       return null;
   }
   html = html.substring(indexURLStart + 5);
   int indexURLEnd = html.toLowerCase().indexOf("\"");
   if (indexURLEnd < 0) {
       return null;
   }
   return html.substring(0, indexURLEnd);
}

This method tries to find the URL from meta tag and returns null if it is not found. This can be called in case of non-redirect status code as a last attempt to fetch the URL –

String getRedirect(String urlstring) throws IOException {
   ...
   if (300 <= httpStatusCode && httpStatusCode <= 308) {
       ...
   } else {
       String metaURL = getMetaRedirectURL(httpEntity);
       EntityUtils.consumeQuietly(httpEntity);
       if (metaURL != null) {
           if (!metaURL.startsWith("http")) {
               URL u = new URL(new URL(urlstring), metaURL);
               return u.toString();
           }
           return metaURL;
       }
   return urlstring;
   }
   ...
}

In this implementation, we can see that there is a check for metaURL starting with http because there may be relative URLs in the meta tag. The java.net.URL library is used to create a final URL string from the relative URL. It can handle all the possibilities of a valid relative URL.

Conclusion

This blog post explains about the resolution of shortened and redirected URLs in Java. It explains about defining requests, executing them using an HTTP client and processing the resultant response to get a redirect URL. It also explains about how to perform these operations repeatedly to process multiple shortenings/redirects and finally to fetch the redirect URL from meta tag.

Loklak uses an inbuilt URL shortener to resolve redirects for a URL. If you find this blog post interesting, please take a look at the URL shortening service of loklak.

Improving Harvesting Decision for Kaizen Harvester in loklak server

About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

  1. Hashtags in Tweets
  2. User mentions in Tweets
  3. Tweets from areas near to each Tweet in timeline.
  4. Tweets older than oldest Tweet in timeline.

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by @yukiisbored in pull request loklak/loklak_server#960.

The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue, or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();

   grabSuggestions();

   return 0;
}

[SOURCE]

In sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

  1. Calculate the ratio of queue filled (q.size() / q.maxSize()).
  2. Generate a random floating point number between 0 and 1.
  3. If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions.

Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

Change in code

The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in following manner –

@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}

[SOURCE]

Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to backend for suggestions whose queries are not added to the queue.

 

Resources

Writing Simple Unit-Tests with JUnit

In the Loklak Server project, we use a number of automation tools like the build testing tool ‘TravisCI’, automated code reviewing tool ‘Codacy’, and ‘Gemnasium’. We are also using JUnit, a java-based unit-testing framework for writing automated Unit-Tests for java projects. It can be used to test methods to check their behaviour whenever there is any change in implementation. These unit-tests are handy and are coded specifically for the project. In the Loklak Server project it is used to test the web-scrapers. Generally JUnit is used to check if there is no change in behaviour of the methods, but in this project, it also helps in keeping check if the website code has been modified, affecting the data that is scraped.

Let’s start with basics, first by setting up, writing a simple Unit-Tests and then Test-Runners. Here we will refer how unit tests have been implemented in Loklak Server to familiarize with the JUnit Framework.

Setting-UP

Setting up JUnit with gradle is easy, You have to do just 2 things:-

1) Add JUnit dependency in build.gradle

Dependencies {

. . .

. . .<other compile groups>. . .

compile group: 'com.twitter', name: 'jsr166e', version: '1.1.0'

compile group: 'com.vividsolutions', name: 'jts', version: '1.13'

compile group: 'junit', name: 'junit', version: '4.12'

compile group: 'org.apache.logging.log4j', name: 'log4j-1.2-api', version: '2.6.2'

compile group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.6.2'

. . .

. . .

}

 

2) Add source for ‘test’ task from where tests are built (like here).

Save all tests in test directory and keep its internal directory structure identical to src directory structure. Now set the path in build.gradle so that they can be compiled.

sourceSets.test.java.srcDirs = ['test']

 

Writing Unit-Tests

In JUnit FrameWork a Unit-Test is a method that tests a particular behaviour of a section of code. Test methods are identified by annotation @Test.

Unit-Test implements methods of source files to test their behaviour. This can be done by fetching the output and comparing it with expected outputs.

The following test tests if twitter url that is created is valid or not that is to be scraped.

/**

 * This unit-test tests twitter url creation

 */

@Test

public void testPrepareSearchURL() {

String url;

String[] query = {"fossasia", "from:loklak_test",

"spacex since:2017-04-03 until:2017-04-05"};

String[] filter = {"video", "image", "video,image", "abc,video"};

String[] out_url = {

"https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

"https://twitter.com/search?f=tweets&vertical=default&q=from%3Aloklak_test&src=typd",

"and other output url strings to be matched…..."

};


// checking simple urls

for (int i = 0; i < query.length; i++) {

url = TwitterScraper.prepareSearchURL(query[i], "");


//compare urls with urls created

assertThat(out_url[i], is(url));

}


// checking urls having filters

for (int i = 0; i < filter.length; i++) {

url = TwitterScraper.prepareSearchURL(query[0], filter[i]);


//compare urls with urls created

assertThat(out_url[i+3], is(url));

}

}

 

Testing the implementation of code is useless as it will either make code more difficult to change or tests useless  . So be cautious while writing tests and keep difference between Implementation and Behaviour in mind.

This is the perfect example for a simple Unit-Test. As we see there are some points, which needs to be observed like:-

1) There is a annotation @Test .

2) Input array of query which is fed to the method TwitterScraper.prepareSearchURL() .

3) Array of urls out_url[], which are the expected urls to output.

4) asserThat() to compare the expected url (in array out_url[]) and the output url (in variable ‘url’).

NOTE: assertEquals() could also be used here, but we prefer to use assert methods to get error message that is readable (We will discuss about this some time later)

And the TestRunner

When we are working on a project, It is not feasible to run tests using gradle as they are first built  (else verified whether tests are build-ready) and then executed. gradle test shall be used only for building and testing the tests. For testing the project, one shall set-up TestRunner. It allows to run specific set of tests, one wants to run.

TestRunners are built once using gradle (with other tests) and can be run whenever you want. Also it is easy to stack up the test classes you want to run in SuiteClasses and @RunWith to run SuiteClasses with the TestRunner.

In loklak server, TestRunner runs the web-scraper tests. They are used by developers to test the changes they have made.

This is a sample TestRunner, code link here .

package org.loklak;


// Library classes imported

import org.junit.runner.RunWith;

import org.junit.runners.Suite;

// Source files to be tested

import org.loklak.harvester.TwitterScraperTest;

import org.loklak.harvester.YoutubeScraperTest;


/*

* TestRunner for harvesters

*/

@RunWith(Suite.class)

@Suite.SuiteClasses({

TwitterScraperTest.class,

YoutubeScraperTest.class

})

public class TestRunner {

}

 

You can also add TestRunners for different sections of the project. Like here it is initialized only to test harvesters.

To run the TestRunner

Add classpath of the jar file of the project and run ‘JUnitCore’ with TestRunner to get output on terminal.

java -classpath .:build/libs/<yourProject>.jar:build/classes/test org.junit.runner.JUnitCore org.loklak.TestRunner

In the project we have set up a shell script to run the tests.

Few points

1) Build the project and tests separately. Build tests only when changed as they take time to be built and executed.

2) Whenever you are done with the coding part, run the tests using TestRunner.

3) Write unit-tests whenever you add a new feature to the project to keep it up-to-date.

Now lets end up here.

So for now, Code it, Test it and Repeat.

Resources: