Making loklak Server’s Kaizen Harvester Extendable

Harvesting strategies in loklak are something that the end users can’t see, but they play a vital role in deciding the way in which messages are collected with loklak. One of the strategies in loklak is defined by the Kaizen Harvester, which generates queries from collected messages.

The original strategy used a simple hash queue which drops queries once it is full. This effect is not desirable as we tend to lose important queries in this process if they come up late while harvesting. To overcome this behaviour without losing important search queries, we needed to come up with new harvesting strategy(ies) that would provide a better approach for harvesting. In this blog post, I am discussing the changes made in the kaizen harvester so it can be extended to create different flavors of harvesters.

What can be different in extended harvesters?

To make the Kaizen harvester extendable, we first needed to decide that what are the parts in the original Kaizen harvester that can be changed to make the strategy different (and probably better).

Since one of the most crucial part of the Kaizen harvester was the way it stores queries to be processed, it was one of the most obvious things to change. Another thing that should be allowed to configure across various strategies was the decision of whether to go for harvesting the queries from the query list.

Query storage with KaizenQueries

To allow different methods of storing the queries, KaizenQueries class was introduced in loklak. It was configured to provide basic methods that would be required for a query storing technique to work. A query storing technique can be any data structure that we can use to store search queries for Kaizen harvester.

public abstract class KaizenQueries {

     public abstract boolean addQuery(String query);

     public abstract String getQuery();

     public abstract int getSize();

     public abstract int getMaxSize();

     public boolean isEmpty() {
         return this.getSize() == 0;
     }
}

[SOURCE]

Also, a default type of KaizenQueries was introduced to use in the original Kaizen harvester. This allowed the same interface as the original queue which was used in the harvester.

Another constructor was introduced in Kaizen harvester which allowed setting the KaizenQueries for an instance of its derived classes. It solved the problem of providing an interface of KaizenQueries inside the Kaizen harvester which can be used by any inherited strategy –

private KaizenQueries queries = null;

public KaizenHarvester(KaizenQueries queries) {
    ...
    this.queries = queries;
    ...
}

public void someMethod() {
    ...
    this.queries.addQuery(someQuery);
    ...
}

[SOURCE]

This being added, getting new queries or adding new queries was a simple. We just need to use getQuery() and addQuery() methods without worrying about the internal implementations.

Configurable decision for harvesting

As mentioned earlier, the decision taken for harvesting should also be configurable. For this, a protected method was implemented and used in harvest() method –

protected boolean shallHarvest() {
    float targetProb = random.nextFloat();
    float prob = 0.5F;
    if (this.queries.getMaxSize() > 0) {
        prob = queries.getSize() / (float)queries.getMaxSize();
    }
    return !this.queries.isEmpty() && targetProb < prob;
}

@Override
public int harvest() {
    if (this.shallHarvest()) {
        return harvestMessages();
    }

    grabSuggestions();

    return 0;
}

[SOURCE]

The shallHarvest() method can be overridden by derived classes to allow any type of harvesting decision that they want. For example, we can configure it in such a way that it harvests if there are any queries in the queue, or to use a different probability distribution instead of linear (maybe gaussian around 1).

Conclusion

This blog post explained about the changes made in loklak’s Kaizen harvester which allowed other harvesting strategies to be built on its top. It discussed the two components changed and how they allowed ease of inheriting the original Kaizen harvester. These changes were proposed in PR loklak/loklak_server#1203 by @singhpratyush (me).

Resources

Adding React based World Mood Tracker to loklak Apps

loklak apps is a website that hosts various apps that are built by using loklak API. It uses static pages and angular.js to make API calls and show results from users. As a part of my GSoC project, I had to introduce the World Mood Tracker app using loklak’s mood API. But since I had planned to work on React, I had to go off from the track of typical app development in loklak apps and integrate a React app in apps.loklak.org.

In this blog post, I will be discussing how I introduced a React based app to apps.loklak.org and solved the problem of country-wise visualisation of mood related data on a World map.

Setting up development environment inside apps.loklak.org

After following the steps to create a new app in apps.loklak.org, I needed to add proper tools and libraries for smooth development of the World Mood Tracker app. In this section, I’ll be explaining the basic configuration that made it possible for a React app to be functional in the angular environment.

Pre-requisites

The most obvious prerequisite for the project was Node.js. I used node v8.0.0 while development of the app. Instead of npm, I decided to go with yarn because of offline caching and Internet speed issues in India.

Webpack and Babel

To begin with, I initiated yarn in the app directory inside project and added basic dependencies –

$ yarn init
$ yarn add webpack webpack-dev-server path
$ yarn add babel-loader babel-core babel-preset-es2015 babel-preset-react --dev

 

Next, I configured webpack to set an entry point and output path for the node project in webpack.config.js

module.exports = {
    entry: './js/index.js',
    output: {
        path: path.resolve('.'),
        filename: 'index_bundle.js'
    },
    ...
};

This would signal to look for ./js/index.js as an entry point while bundling. Similarly, I configured babel for es2015 and React presets –

{
  "presets":[
    "es2015", "react"
  ]
}

 

After this, I was in a state to define loaders for module in webpack.config.js. The loaders would check for /\.js$/ and /\.jsx$/ and assign them to babel-loader (with an exclusion of node_modules).

React

After configuring the basic presets and loaders, I added React to dependencies of the project –

$ yarn add react react-dom

 

The React related files needed to be in ./js/ directory so that the webpack can bundle it. I used the file to create a simple React app –

import React from 'react';
import ReactDOM from 'react-dom';

ReactDOM.render(
    <div>World Mood Tracker</div>,
    document.getElementById('app')
);

 

After this, I was in a stage where it was possible to use this app as a part of apps.loklak.org app. But to do this, I first needed to compile these files and bundle them so that the external app can use it.

Configuring the build target for webpack

In apps.loklak.org, we need to have a file by the name of index.html in the app’s root directory. Here, we also needed to place the bundled js properly so it could be included in index.html at app’s root.

HTML Webpack Plugin

Using html-webpack-plugin, I enabled auto building of project in the app’s root directory by using the following configuration in webpack.config.js

...
const HtmlWebpackPlugin = require('html-webpack-plugin');
const HtmlWebpackPluginConfig = new HtmlWebpackPlugin({
    template: './js/index.html',
    filename: 'index.html',
    inject: 'body'
});
module.exports = {
    ...
    plugins: [HtmlWebpackPluginConfig]
};

 

This would build index.html at app’s root that would be discoverable externally.

To enable bundling of the project using simple yarn build command, the following lines were added to package.json

{
  ..
  "scripts": {
    ..
    "build": "webpack -p"
  }
}

After a simple yarn build command, we can see the bundled js and html being created at the app root.

Using datamaps for visualization

Datamaps is a JS library which allows plotting of data on map using D3 as backend. It provides a simple interface for creating visualizations and comes with a handy npm installation –

$ yarn add datamaps

Map declaration and usage as state

A map from datamaps was used a state for React component which allowed fast rendering of changes in the map as the state of React component changes –

export default class WorldMap extends React.Component {
    constructor() {
        super();
        this.state = {
            map: null
        };
    render() {
        return (<div className={styles.container}>
                <div id="map-container"></div></div>)
    }
    componentDidMount() {
        this.setState({map: new Datamap({...})});
    }
    ...
}

 

The declaration of map goes in componentDidMount method because it would not be possible to start the map until we have the div with id=”map-container” in the DOM. It was necessary to draw the map only after the component has mounted otherwise it would fail due to no id=”map-container” in the DOM.

Defining data for countries

Data for every country had two components –

data = {
    positiveScore: someValue,
    negativeScore: someValue
}

 

This data is used to generate popup for the counties –

this.setState({
    map: new Datamap({
        ...
        geographyConfig: {
            ...
            popupTemplate: function (geo, data) {
                // Configure variables pScore so that it gives “No Data” when data.positiveScore is not set (similar for negative)
                return [
                    // Use pScore and nScore to generate results here
                    // geo.properties.name would give current country name
                ].join('');
            }
        }
    })
});

The result for countries with unknown data values look something like this –

Conclusion

In this blog post, I explained about introducing a React based app in app.loklak.org’s angular based environment. I discussed the setup and bundling process of the project so it becomes available from the project’s external HTTP server.

I also discussed using datamaps as a visualisation tool for data about Tweets. The app was first introduced in pull request fossasia/apps.loklak.org#189 and was improved step by step in subsequent patches.

Resources

Introducing Priority Kaizen Harvester for loklak server

In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak.

Background, motivation and approach

Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester.

The produce consumer imbalance in Kaizen harvester

Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156)

Learnings from attempt to add blocking queue for queries

As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them –

public class BlockingKaizenHarvester extends KaizenHarvester {
   ...
   public BlockingKaizenHarvester() {
       super(new KaizenQueries() {
           ...
           private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize);

           @Override
           public boolean addQuery(String query) {
               if (this.queries.contains(query)) {
                   return false;
               }
               try {
                   this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS);
                   return true;
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't add query: " + query, e);
                   return false;
               }
           }
           @Override
           public String getQuery() {
               try {
                   return this.queries.take();
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't get any query", e);
                   return null;
               }
           }
           ...
       });
   }
}

[SOURCE, loklak/loklak_server#1210]

But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted).

Therefore, using a blocking queue in KaizenQueries is not a good thing to do.

Other considerations

After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues.

Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how much queue size are we allowing and the queries would eventually get dropped.

Persistent disk queues would allow us to store greater number of queries but the major disadvantage was lookup time. It would terribly slow to check if a query already exists in the disk queue when the queue is large. Also, since the queries would always increase practically, the disk queue would also go out of hand at some point in time.

So by now, we were clear that not dropping queries is not an alternative. So what we had to use the limited size queue smartly so that we do not drop queries that are important.

Solution: Priority Queue

So a good solution to our problem was a priority queue. We could assign a higher score to queries that come from more popular Tweets and they would go higher in the queue and do not drop off until we have even higher priority queried in the queue.

Assigning score to a Tweet

Score for a tweet was decided using the following formula –

α= 5* (retweet count)+(favourite count)

score=α/(α+10*exp(-0.01*α))

This equation generates a score between zero and one from the retweet and favourite count of a Tweet. This normalisation of score would ensure we do not assign an insanely large score to Tweets with a high retweet and favourite count. You can see the behaviour for the second mentioned equation here.

Graph?

Changes required in existing Kaizen harvester

To take a score into account, it became necessary to add an interface to also provide a score as a parameter to the addQuery() method in KaizenQueries. Also, not all queries can have a score associated with it, for example, if we add a query that would search for Tweets older than the oldest in the current timeline, giving it a score wouldn’t be possible as it would not be associated with a single Tweet. To tackle this, a default score of 0.5 was given to these queries –

public abstract class KaizenQueries {

   public boolean addQuery(String query) {
       return this.addQuery(query, 0.5);
   }

   public abstract boolean addQuery(String query, double score);
   ...
}

[SOURCE]

Defining appropriate KaizenQueries object

The KaizenQueries object for a priority queue had to define a wrapper class that would hold the query and its score together so that they could be inserted in a queue as a single object.

ScoreWrapper and comparator

The ScoreWrapper is a simple class that stores score and query object together –

private class ScoreWrapper {

   private double score;
   private String query;

   ScoreWrapper(String m, double score) {
       this.query = m;
       this.score = score;
   }

}

[SOURCE]

In order to define a way to sort the ScoreWrapper objects in the priority queue, we need to define a Comparator for it –

private Comparator<ScoreWrapper> scoreComparator = (scoreWrapper, t1) -> (int) (scoreWrapper.score - t1.score);

[SOURCE]

Putting things together

Now that we have all the ingredients to declare our priority queue, we can also declare the strategy to getQuery and putQuery in the corresponding KaizenQueries object –

public class PriorityKaizenHarvester extends KaizenHarvester {

   private static class PriorityKaizenQueries extends KaizenQueries {
       ...
       private Queue<ScoreWrapper> queue;
       private int maxSize;

       public PriorityKaizenQueries(int size) {
           this.maxSize = size;
           queue = new PriorityQueue<>(size, scoreComparator);
       }

       @Override
       public boolean addQuery(String query, double score) {
           ScoreWrapper sw = new ScoreWrapper(query, score);
           if (this.queue.contains(sw)) {
               return false;
           }
           try {
               this.queue.add(sw);
               return true;
           } catch (IllegalStateException e) {
               return false;
           }
       }

       @Override
       public String getQuery() {
           return this.queue.poll().query;
       }
       ...
}

[SOURCE]

Conclusion

In this blog post, I discussed the process in which PriorityKaizen harvester was introduced to loklak. This strategy is a flavour of Kaizen harvester which uses a priority queue to store queries that are to be processed. These changes were possible because of a previous patch which allowed extending of Kaizen harvester.

The changes were introduced in pull request loklak/loklak#1240 by @singhpratyush (me).

Resources

Fetching URL for Embedded Twitter Videos in loklak server

The primary web service that loklak scrapes is Twitter. Being a news and social networking service, Twitter allows its users to post videos directly to Twitter and they convey more thoughts than what text can. But for an automated scraper, getting the links is not a simple task.

Let us see that what were the problems we faced with videos and how we solved them in the loklak server project.

Previous setup and embedded videos

In the previous version of loklak server, the TwitterScraper searched for videos in 2 ways –

  1. Youtube links
  2. HTML5 video links

To fetch the video URL from HTML5 video, following snippet was used –

if ((p = input.indexOf("<source video-src")) >= 0 && input.indexOf("type=\"video/") > p) {
   String video_url = new prop(input, p, "video-src").value;
   videos.add
   continue;
}

Here, input is the current line from raw HTML that is being processed and prop is a class defined in loklak that is useful in parsing HTML attributes. So in this way, the HTML5 videos were extracted.

The Problem – Embedded videos

Though the previous setup had no issues, it was useless as Twitter embeds the videos in an iFrame and therefore, can’t be fetched using simple HTML5 tag extraction.

If we take the following Tweet for example,

the requested HTML from the search page contains video in following format –

<src="https://twitter.com/i/videos/tweet/881946694413422593?embed_source=clientlib&player_id=0&rpc_init=1" allowfullscreen="" id="player_tweet_881946694413422593" style="width: 100%; height: 100%; position: absolute; top: 0; left: 0;">

So we needed to come up with a better technique to get those videos.

Parsing video URL from iFrame

The <div> which contains video is marked with AdaptiveMedia-videoContainer class. So if a Tweet has an iFrame containing video, it will also have the mentioned class.

Also, the source of iFrame is of the form https://twitter.com/i/videos/tweet/{Tweet-ID}. So now we can programmatically go to any Tweet’s video and parse it to get results.

Extracting video URL from iFrame source

Now that we have the source of iFrame, we can easily get the video source using the following flow –

public final static Pattern videoURL = Pattern.compile("video_url\\\":\\\"(.*?)\\\"");

private static String[] fetchTwitterIframeVideos(String iframeURL) {
   // Read fron iframeURL line by line into BufferReader br
   while ((line = br.readLine()) != null ) {
       int index;
       if ((index = line.indexOf("data-config=")) >= 0) {
           String jsonEscHTML = (new prop(line, index, "data-config")).value;
           String jsonUnescHTML = HtmlEscape.unescapeHtml(jsonEscHTML);
           Matcher m = videoURL.matcher(jsonUnescHTML);
           if (!m.find()) {
               return new String[]{};
           }
           String url = m.group(1);
           url = url.replace("\\/", "/");  // Clean URL
           /*
            * Play with url and return results
            */
       }
   }
}

MP4 and M3U8 URLs

If we encounter mp4 URLs, we’re fine as it is the direct link to video. But if we encounter m3u8 URL, we need to process it further before we can actually get to the videos.

For Twitter, the hosted m3u8 videos contain link to further m3u8 videos which are of different resolution. These m3u8 videos again contain link to various .ts files that contain actual video in parts of 3 seconds length each to support better streaming experience on the web.

To resolve videos in such a setup, we need to recursively parse m3u8 files and collect all the .ts videos.

private static String[] extractM3u8(String url) {
   return extractM3u8(url, "https://video.twimg.com/");
}

private static String[] extractM3u8(String url, String baseURL) {
   // Read from baseURL + url line by line
   while ((line = br.readLine()) != null) {
       if (line.startsWith("#")) {  // Skip comments in m3u8
           continue;
       }
       String currentURL = (new URL(new URL(baseURL), line)).toString();
       if (currentURL.endsWith(".m3u8")) {
           String[] more = extractM3u8(currentURL, baseURL);  // Recursively add all
           Collections.addAll(links, more);
       } else {
           links.add(currentURL);
       }
   }
   return links.toArray(new String[links.size()]);
}

And then in fetchTwitterIframeVideos, we can return the all .ts URLs for the video –

if (url.endsWith(".mp4")) {
   return new String[]{url};
} else if (url.endsWith(".m3u8")) {
   return extractM3u8(url);
}

Putting things together

Finally, the TwitterScraper can discover the video links by tweaking a little –

if (input.indexOf("AdaptiveMedia-videoContainer") > 0) {
   // Fetch Tweet ID
   String tweetURL = props.get("tweetstatusurl").value;
   int slashIndex = tweetURL.lastIndexOf('/');
   if (slashIndex < 0) {
       continue;
   }
   String tweetID = tweetURL.substring(slashIndex + 1);
   String iframeURL = "https://twitter.com/i/videos/tweet/" + tweetID;
   String[] videoURLs = fetchTwitterIframeVideos(iframeURL);
   Collections.addAll(videos, videoURLs);
}

Conclusion

This blog post explained the process of extracting video URL from Twitter and the problem faced. The discussed change enabled loklak to extract and serve URLs to video for tweets. It was introduced in PR loklak/loklak_server#1193 by me (@singhpratyush).

The service was further enhanced to collect single mp4 link for videos (see PR loklak/loklak_server#1206), which is discussed in another blog post.

Resources

Fetching URL for Complete Twitter Videos in loklak server

In the previous blog post, I discussed how to fetch the URLs for Twitter videos in parts (.ts extension). But getting a video in parts is not beneficial as the loklak users have to carry out the following task in order to make sense out of it:

  • Placing the videos in correct order (the videos are divided into 3-second sections).
  • Having proper libraries and video player to play the .ts extension.

This would require fairly complex loklak clients and hence the requirement was to have complete video in a single link with a popular extension. In this blog post, I’ll be discussing how I managed to get links to complete Twitter videos.

Guests and Twitter Videos

Most of the content on Twitter is publicly accessible and we don’t need an account to access it. And this public content includes videos too. So, there should be some way in which Twitter would be handling guest users and serving them the videos. We needed to replicate the same flow in order to get links to those videos.

Problem with Twitter video and static HTML

In Twitter, the videos are not served with the static HTML of a page. It is generally rendered using a front-end JavaScript framework. Let us take an example of mobile.twitter.com website.

Let us consider the video from a tweet of @HiHonourIndia

We can see that the page is rendered using ReactJS and we also have the direct link for the video –

“So what’s the problem then? We can just request the web page and parse HTML to get video link, right?”

Wrong. As I mentioned earlier, the pages are rendered using React and when we initially request it, it looks something like this –

The HTML contains no link to video whatsoever, and keeping in mind that we would be getting the previously mentioned HTML, the scraper wouldn’t be getting any video link either.

We, therefore, need to mimic the flow which is followed internally in the web app to get the video link and play them.

Mimicking the flow of Twitter Mobile to get video links

After tracking the XHR requests made to by the Twitter Mobile web app, one can come up with the forthcoming mentioned flow to get video URLs.

Mobile URL for a Tweet

Getting mobile URL for a tweet is very simple –

String mobileUrl = "https://mobile.twitter.com" + tweetUrl;

Here, tweet URL is of the type /user/tweetID.

Guest Token and Bearer JS URL

The Bearer JS is a file which contains Bearer Token which along with a Guest Token is used to authenticate Twitter API to get details about a conversation. The guest token and bearer script URL can be extracted from the static mobile page –

Pattern bearerJsUrlRegex = Pattern.compile(showFailureMessage\\(\\'(.*?main.*?)\\’\\););
Pattern guestTokenRegex = Pattern.compile(document\\.cookie \\= decodeURIComponent\\(\\\”gt\\=([0-9]+););
ClientConnection conn = new ClientConnection(mobileUrl);
BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
String line;
while ((line = br.readLine()) != null) {
   if (bearerJsUrl != null && guestToken != null) {
       // Both the entities are found
       break;
   }
   if (line.length() == 0) {
       continue;
   }
   Matcher m = bearerJsUrlRegex.matcher(line);
   if (m.find()) {
       bearerJsUrl = m.group(1);
       continue;
   }
   m = guestTokenRegex.matcher(line);
   if (m.find()) {
       guestToken = m.group(1);
   }
}

[SOURCE]

Getting Bearer Token from Bearer JS URL

The following simple method can be used to fetch the Bearer Token from URL –

private static final Pattern bearerTokenRegex = Pattern.compile(BEARER_TOKEN:\\\”(.*?)\\\””);
private static String getBearerTokenFromJs(String jsUrl) throws IOException {
   ClientConnection conn = new ClientConnection(jsUrl);
   BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
   String line = br.readLine();
   Matcher m = bearerTokenRegex.matcher(line);
   if (m.find()) {
       return m.group(1);
   }
   throw new IOException(Couldn\’t get BEARER_TOKEN);
}

[SOURCE]

Using the Guest Token and Bearer Token to get Video Links

The following method demonstrates the process of getting video links once we have all the required information –

private static String[] getConversationVideos(String tweetId, String bearerToken, String guestToken) throws IOException {
   String conversationApiUrl = https://api.twitter.com/2/timeline/conversation/” + tweetId + “.json”;
   CloseableHttpClient httpClient = getCustomClosableHttpClient(true);
   HttpGet req = new HttpGet(conversationApiUrl);
   req.setHeader(User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36);
   req.setHeader(Authorization, Bearer  + bearerToken);
   req.setHeader(x-guest-token, guestToken);
   HttpEntity entity = httpClient.execute(req).getEntity();
   String html = getHTML(entity);
   consumeQuietly(entity);
   try {
       JSONArray arr = (new JSONObject(html)).getJSONObject(globalObjects).getJSONObject(tweets)
               .getJSONObject(tweetId).getJSONObject(extended_entities).getJSONArray(media);
       JSONObject obj2 = (JSONObject) arr.get(0);
       JSONArray videos = obj2.getJSONObject(video_info).getJSONArray(variants);
       ArrayList<String> urls = new ArrayList<>();
       for (int i = 0; i < videos.length(); i++) {
           String url = ((JSONObject) videos.get(i)).getString(url);
           urls.add(url);
       }
       return urls.toArray(new String[urls.size()]);
   } catch (JSONException e) {
       // This is not an issue. Sometimes, there are videos in long conversations but other ones get media class
       //  div, so this fetching process is triggered.
   }
   return new String[]{};
}

[SOURCE]

Checking if a Tweet contains video

If a tweet contains a video, we can add the following lines to recognise it in TwitterScraper.java

if (input.indexOf(AdaptiveMedia-videoContainer) > 0) {
   // Do necessary things
}

[SOURCE]

Limitations

Though this method successfully extracts the video links to complete Twitter videos, it makes the scraping process very slow. This is because, for every tweet that contains a video, three HTTP requests are made in order to finalise the tweet. And keeping in mind that there are up to 20 Tweets per search from Twitter, we get instances where more than 10 of them are videos (30 HTTP requests). Also, there is a lot of JSON and regex processing involved which adds a little to the whole “slow down” thing.

Conclusion

This post explained how loklak server was improved to fetch links to complete video URLs from Twitter and the exact flow of requests in order to achieve so. The changes were proposed in pull requests loklak/loklak_server#1206.

Resources

URL Unshortening in Java for loklak server

There are many URL shortening services on the internet. They are useful in converting really long URLs to shorter ones. But apart from redirecting to a longer URL, they are often used to track the people visiting those links.

One of the components of loklak server is its URL unshortening and redirect resolution service, which ensures that websites can’t track the users using those links and enhances the protection of privacy. How this service works in loklak.

Redirect Codes in HTTP

Various standards define 3XX status codes as an indication that the client must perform additional actions to complete the request. These response codes range from 300 to 308, based on the type of redirection.

To check the redirect code of a request, we must first make a request to some URL –

String urlstring = "http://tinyurl.com/8kmfp";
HttpRequestBase req = new HttpGet(urlstring);

Next, we will configure this request to disable redirect and add a nice Use-Agent so that websites do not block us as a robot –

req.setConfig(RequestConfig.custom().setRedirectsEnabled(false).build());
req.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36");

Now we need a HTTP client to execute this request. Here, we will use Apache’s CloseableHttpClient

CloseableHttpClient httpClient = HttpClients.custom()
                                   .setConnectionManager(getConnctionManager(true))
                                   .setDefaultRequestConfig(defaultRequestConfig)
                                   .build();

The getConnctionManager returns a pooling connection manager that can reuse the existing TCP connections, making the requests very fast. It is defined in org.loklak.http.ClientConnection.

Now we have a client and a request. Let’s make our client execute the request and we shall get an HTTP entity on which we can work.

HttpResponse httpResponse = httpClient.execute(req);
HttpEntity httpEntity = httpResponse.getEntity();

Now that we have executed the request, we can check the status code of the response by calling the corresponding method –

if (httpEntity != null) {
   int httpStatusCode = httpResponse.getStatusLine().getStatusCode();
   System.out.println("Status code - " + httpStatusCode);
} else {
   System.out.println("Request failed");
}

Hence, we have the HTTP code for the requests we make.

Getting the Redirect URL

We can simply check for the value of the status code and decide whether we have a redirect or not. In the case of a redirect, we can check for the “Location” header to know where it redirects.

if (300 <= httpStatusCode && httpStatusCode <= 308) {
   for (Header header: httpResponse.getAllHeaders()) {
       if (header.getName().equalsIgnoreCase("location")) {
           redirectURL = header.getValue();
       }
   }
}

Handling Multiple Redirects

We now know how to get the redirect for a URL. But in many cases, the URLs redirect multiple times before reaching a final, stable location. To handle these situations, we can repeatedly fetch redirect URL for intermediate links until we saturate. But we also need to take care of cyclic redirects so we set a threshold on the number of redirects that we have undergone –

String urlstring = "http://tinyurl.com/8kmfp";
int termination = 10;
while (termination-- > 0) {
   String unshortened = getRedirect(urlstring);
   if (unshortened.equals(urlstring)) {
       return urlstring;
   }
   urlstring = unshortened;
}

Here, getRedirect is the method which performs single redirect for a URL and returns the same URL in case of non-redirect status code.

Redirect with non-3XX HTTP status – meta refresh

In addition to performing redirects through 3XX codes, some websites also contain a <meta http-equiv=”refresh” … > which performs an unconditional redirect from the client side. To detect these types of redirects, we need to look into the HTML content of a response and parse the URL from it. Let us see how –

String getMetaRedirectURL(HttpEntity httpEntity) throws IOException {
   StringBuilder sb = new StringBuilder();
   BufferedReader reader = new BufferedReader(new InputStreamReader(httpEntity.getContent()));
   String content = null;
   while ((content = reader.readLine()) != null) {
       sb.append(content);
   }
   String html = sb.toString();
   html = html.replace("\n", "");
   if (html.length() == 0)
       return null;
   int indexHttpEquiv = html.toLowerCase().indexOf("http-equiv=\"refresh\"");
   if (indexHttpEquiv < 0) {
       return null;
   }
   html = html.substring(indexHttpEquiv);
   int indexContent = html.toLowerCase().indexOf("content=");
   if (indexContent < 0) {
       return null;
   }
   html = html.substring(indexContent);
   int indexURLStart = html.toLowerCase().indexOf(";url=");
   if (indexURLStart < 0) {
       return null;
   }
   html = html.substring(indexURLStart + 5);
   int indexURLEnd = html.toLowerCase().indexOf("\"");
   if (indexURLEnd < 0) {
       return null;
   }
   return html.substring(0, indexURLEnd);
}

This method tries to find the URL from meta tag and returns null if it is not found. This can be called in case of non-redirect status code as a last attempt to fetch the URL –

String getRedirect(String urlstring) throws IOException {
   ...
   if (300 <= httpStatusCode && httpStatusCode <= 308) {
       ...
   } else {
       String metaURL = getMetaRedirectURL(httpEntity);
       EntityUtils.consumeQuietly(httpEntity);
       if (metaURL != null) {
           if (!metaURL.startsWith("http")) {
               URL u = new URL(new URL(urlstring), metaURL);
               return u.toString();
           }
           return metaURL;
       }
   return urlstring;
   }
   ...
}

In this implementation, we can see that there is a check for metaURL starting with http because there may be relative URLs in the meta tag. The java.net.URL library is used to create a final URL string from the relative URL. It can handle all the possibilities of a valid relative URL.

Conclusion

This blog post explains about the resolution of shortened and redirected URLs in Java. It explains about defining requests, executing them using an HTTP client and processing the resultant response to get a redirect URL. It also explains about how to perform these operations repeatedly to process multiple shortenings/redirects and finally to fetch the redirect URL from meta tag.

Loklak uses an inbuilt URL shortener to resolve redirects for a URL. If you find this blog post interesting, please take a look at the URL shortening service of loklak.

Improving Harvesting Decision for Kaizen Harvester in loklak server

About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

  1. Hashtags in Tweets
  2. User mentions in Tweets
  3. Tweets from areas near to each Tweet in timeline.
  4. Tweets older than oldest Tweet in timeline.

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by @yukiisbored in pull request loklak/loklak_server#960.

The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue, or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();

   grabSuggestions();

   return 0;
}

[SOURCE]

In sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

  1. Calculate the ratio of queue filled (q.size() / q.maxSize()).
  2. Generate a random floating point number between 0 and 1.
  3. If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions.

Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

Change in code

The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in following manner –

@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}

[SOURCE]

Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to backend for suggestions whose queries are not added to the queue.

 

Resources

Using NodeBuilder to instantiate node based Elasticsearch client and Visualising data

As elastic.io mentions, Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. But in many setups, it is not possible to manually install an Elasticsearch node on a machine. To handle these type of scenarios, Elasticsearch provides the NodeBuilder module, which can be used to spawn Elasticsearch node programmatically. Let’s see how.

Getting Dependencies

In order to get the ES Java API, we need to add the following line to dependencies.

compile group: 'org.elasticsearch', name: 'securesm', version: '1.0'

The required packages will be fetched the next time we gradle build.

Configuring Settings

In the Elasticsearch Java API, Settings are used to configure the node(s). To create a node, we first need to define its properties.

Settings.Builder settings = new Settings.Builder();

settings.put("cluster.name", "cluster_name");  // The name of the cluster

// Configuring HTTP details
settings.put("http.enabled", "true");
settings.put("http.cors.enabled", "true");
settings.put("http.cors.allow-origin", "https?:\/\/localhost(:[0-9]+)?/");  // Allow requests from localhost
settings.put("http.port", "9200");

// Configuring TCP and host
settings.put("transport.tcp.port", "9300");
settings.put("network.host", "localhost");

// Configuring node details
settings.put("node.data", "true");
settings.put("node.master", "true");

// Configuring index
settings.put("index.number_of_shards", "8");
settings.put("index.number_of_replicas", "2");
settings.put("index.refresh_interval", "10s");
settings.put("index.max_result_window", "10000");

// Defining paths
settings.put("path.conf", "/path/to/conf/");
settings.put("path.data", "/path/to/data/");
settings.put("path.home", "/path/to/data/");

settings.build();  // Buid with the assigned configurations

There are many more settings that can be tuned in order to get desired node configuration.

Building the Node and Getting Clients

The Java API makes it very simple to launch an Elasticsearch node. This example will make use of settings that we just built.

Node elasticsearchNode = NodeBuilder.nodeBuilder().local(false).settings(settings).node();

A piece of cake. Isn’t it? Let’s get a client now, on which we can execute our queries.

Client elasticsearhClient = elasticsearchNode.client();

Shutting Down the Node

elasticsearchNode.close();

A nice implementation of using the module can be seen at ElasticsearchClient.java in the loklak project. It uses the settings from a configuration file and builds the node using it.


Visualisation using elasticsearch-head

So by now, we have an Elasticsearch client which is capable of doing all sorts of operations on the node. But how do we visualise the data that is being stored? Writing code and running it every time to check results is a lengthy thing to do and significantly slows down development/debugging cycle.

To overcome this, we have a web frontend called elasticsearch-head which lets us execute Elasticsearch queries and monitor the cluster.

To run elasticsearch-head, we first need to have grunt-cli installed –

$ sudo npm install -g grunt-cli

Next, we will clone the repository using git and install dependencies –

$ git clone git://github.com/mobz/elasticsearch-head.git
$ cd elasticsearch-head
$ npm install

Next, we simply need to run the server and go to indicated address on a web browser –

$ grunt server

At the top, enter the location at which elasticsearch-head can interact with the cluster and Connect.

Upon connecting, the dashboard appears telling about the status of cluster –

The dashboard shown above is from the loklak project (will talk more about it).

There are 5 major sections in the UI –
1. Overview: The above screenshot, gives details about the indices and shards of the cluster.
2. Index: Gives an overview of all the indices. Also allows to add new from the UI.
3. Browser: Gives a browser window for all the documents in the cluster. It looks something like this –

The left pane allows us to set the filter (index, type and field). The table listed is sortable. But we don’t always get what we are looking for manually. So, we have the following two sections.
4. Structured Query: Gives a dead simple UI that can be used to make a well structured request to Elasticsearch. This is what we need to search for to get Tweets from @gsoc that are indexed –

5. Any Request: Gives an advance console that allows executing any query allowable by Elasticsearch API.

A little about the loklak project and Elasticsearch

loklak is a server application which is able to collect messages from various sources, including twitter. The server contains a search index and a peer-to-peer index sharing interface. All messages are stored in an elasticsearch index.

Source: github/loklak/loklak_server

The project uses Elasticsearch to index all the data that it collects. It uses NodeBuilder to create Elasticsearch node and process the index. It is flexible enough to join an existing cluster instead of creating a new one, just by changing the configuration file.

Conclusion

This blog post tries to explain how NodeBuilder can be used to create Elasticsearch nodes and how they can be configured using Elasticsearch Settings.

It also demonstrates the installation and basic usage of elasticsearch-head, which is a great library to visualise and check queries against an Elasticsearch cluster.

The official Elasticsearch documentation is a good source of reference for its Java API and all other aspects.