Introducing Priority Kaizen Harvester for loklak server

Post author:Pratyush
Post published:July 13, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak.

Background, motivation and approach

Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester.

The produce consumer imbalance in Kaizen harvester

Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156)

Learnings from attempt to add blocking queue for queries

As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them –

public class BlockingKaizenHarvester extends KaizenHarvester {
   ...
   public BlockingKaizenHarvester() {
       super(new KaizenQueries() {
           ...
           private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize);

           @Override
           public boolean addQuery(String query) {
               if (this.queries.contains(query)) {
                   return false;
               }
               try {
                   this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS);
                   return true;
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't add query: " + query, e);
                   return false;
               }
           }
           @Override
           public String getQuery() {
               try {
                   return this.queries.take();
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't get any query", e);
                   return null;
               }
           }
           ...
       });
   }
}

[SOURCE, loklak/loklak_server#1210]

But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted).

Therefore, using a blocking queue in KaizenQueries is not a good thing to do.

Other considerations

After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues.

Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how much queue size are we allowing and the queries would eventually get dropped.

Persistent disk queues would allow us to store greater number of queries but the major disadvantage was lookup time. It would terribly slow to check if a query already exists in the disk queue when the queue is large. Also, since the queries would always increase practically, the disk queue would also go out of hand at some point in time.

So by now, we were clear that not dropping queries is not an alternative. So what we had to use the limited size queue smartly so that we do not drop queries that are important.

Solution: Priority Queue

So a good solution to our problem was a priority queue. We could assign a higher score to queries that come from more popular Tweets and they would go higher in the queue and do not drop off until we have even higher priority queried in the queue.

Assigning score to a Tweet

Score for a tweet was decided using the following formula –

α= 5* (retweet count)+(favourite count)

score=α/(α+10*exp(-0.01*α))

This equation generates a score between zero and one from the retweet and favourite count of a Tweet. This normalisation of score would ensure we do not assign an insanely large score to Tweets with a high retweet and favourite count. You can see the behaviour for the second mentioned equation here.

Graph?

Changes required in existing Kaizen harvester

To take a score into account, it became necessary to add an interface to also provide a score as a parameter to the addQuery() method in KaizenQueries. Also, not all queries can have a score associated with it, for example, if we add a query that would search for Tweets older than the oldest in the current timeline, giving it a score wouldn’t be possible as it would not be associated with a single Tweet. To tackle this, a default score of 0.5 was given to these queries –

public abstract class KaizenQueries {

   public boolean addQuery(String query) {
       return this.addQuery(query, 0.5);
   }

   public abstract boolean addQuery(String query, double score);
   ...
}

[SOURCE]

Defining appropriate KaizenQueries object

The KaizenQueries object for a priority queue had to define a wrapper class that would hold the query and its score together so that they could be inserted in a queue as a single object.

ScoreWrapper and comparator

The ScoreWrapper is a simple class that stores score and query object together –

private class ScoreWrapper {

   private double score;
   private String query;

   ScoreWrapper(String m, double score) {
       this.query = m;
       this.score = score;
   }

}

[SOURCE]

In order to define a way to sort the ScoreWrapper objects in the priority queue, we need to define a Comparator for it –

private Comparator<ScoreWrapper> scoreComparator = (scoreWrapper, t1) -> (int) (scoreWrapper.score - t1.score);

[SOURCE]

Putting things together

Now that we have all the ingredients to declare our priority queue, we can also declare the strategy to getQuery and putQuery in the corresponding KaizenQueries object –

public class PriorityKaizenHarvester extends KaizenHarvester {

   private static class PriorityKaizenQueries extends KaizenQueries {
       ...
       private Queue<ScoreWrapper> queue;
       private int maxSize;

       public PriorityKaizenQueries(int size) {
           this.maxSize = size;
           queue = new PriorityQueue<>(size, scoreComparator);
       }

       @Override
       public boolean addQuery(String query, double score) {
           ScoreWrapper sw = new ScoreWrapper(query, score);
           if (this.queue.contains(sw)) {
               return false;
           }
           try {
               this.queue.add(sw);
               return true;
           } catch (IllegalStateException e) {
               return false;
           }
       }

       @Override
       public String getQuery() {
           return this.queue.poll().query;
       }
       ...
}

[SOURCE]

Conclusion

In this blog post, I discussed the process in which PriorityKaizen harvester was introduced to loklak. This strategy is a flavour of Kaizen harvester which uses a priority queue to store queries that are to be processed. These changes were possible because of a previous patch which allowed extending of Kaizen harvester.

The changes were introduced in pull request loklak/loklak#1240 by @singhpratyush (me).

Resources

Pull request for BlockingKaizen harvester – https://github.com/loklak/loklak_server/pull/1210.
Priority queue in Java documentation – https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html.
Code for generating the score graph – https://gist.github.com/singhpratyush/fb0758294342b1a5870888b7a70f1355.

Fetching URL for Embedded Twitter Videos in loklak server

Post author:Pratyush
Post published:July 13, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

The primary web service that loklak scrapes is Twitter. Being a news and social networking service, Twitter allows its users to post videos directly to Twitter and they convey more thoughts than what text can. But for an automated scraper, getting the links is not a simple task.

Let us see that what were the problems we faced with videos and how we solved them in the loklak server project.

Previous setup and embedded videos

In the previous version of loklak server, the TwitterScraper searched for videos in 2 ways –

Youtube links
HTML5 video links

To fetch the video URL from HTML5 video, following snippet was used –

if ((p = input.indexOf("<source video-src")) >= 0 && input.indexOf("type=\"video/") > p) {
   String video_url = new prop(input, p, "video-src").value;
   videos.add
   continue;
}

Here, input is the current line from raw HTML that is being processed and prop is a class defined in loklak that is useful in parsing HTML attributes. So in this way, the HTML5 videos were extracted.

The Problem – Embedded videos

Though the previous setup had no issues, it was useless as Twitter embeds the videos in an iFrame and therefore, can’t be fetched using simple HTML5 tag extraction.

If we take the following Tweet for example,

the requested HTML from the search page contains video in following format –

<src="https://twitter.com/i/videos/tweet/881946694413422593?embed_source=clientlib&player_id=0&rpc_init=1" allowfullscreen="" id="player_tweet_881946694413422593" style="width: 100%; height: 100%; position: absolute; top: 0; left: 0;">

So we needed to come up with a better technique to get those videos.

Parsing video URL from iFrame

The <div> which contains video is marked with AdaptiveMedia-videoContainer class. So if a Tweet has an iFrame containing video, it will also have the mentioned class.

Also, the source of iFrame is of the form https://twitter.com/i/videos/tweet/{Tweet-ID}. So now we can programmatically go to any Tweet’s video and parse it to get results.

Extracting video URL from iFrame source

Now that we have the source of iFrame, we can easily get the video source using the following flow –

public final static Pattern videoURL = Pattern.compile("video_url\\\":\\\"(.*?)\\\"");

private static String[] fetchTwitterIframeVideos(String iframeURL) {
   // Read fron iframeURL line by line into BufferReader br
   while ((line = br.readLine()) != null ) {
       int index;
       if ((index = line.indexOf("data-config=")) >= 0) {
           String jsonEscHTML = (new prop(line, index, "data-config")).value;
           String jsonUnescHTML = HtmlEscape.unescapeHtml(jsonEscHTML);
           Matcher m = videoURL.matcher(jsonUnescHTML);
           if (!m.find()) {
               return new String[]{};
           }
           String url = m.group(1);
           url = url.replace("\\/", "/");  // Clean URL
           /*
            * Play with url and return results
            */
       }
   }
}

MP4 and M3U8 URLs

If we encounter mp4 URLs, we’re fine as it is the direct link to video. But if we encounter m3u8 URL, we need to process it further before we can actually get to the videos.

For Twitter, the hosted m3u8 videos contain link to further m3u8 videos which are of different resolution. These m3u8 videos again contain link to various .ts files that contain actual video in parts of 3 seconds length each to support better streaming experience on the web.

To resolve videos in such a setup, we need to recursively parse m3u8 files and collect all the .ts videos.

private static String[] extractM3u8(String url) {
   return extractM3u8(url, "https://video.twimg.com/");
}

private static String[] extractM3u8(String url, String baseURL) {
   // Read from baseURL + url line by line
   while ((line = br.readLine()) != null) {
       if (line.startsWith("#")) {  // Skip comments in m3u8
           continue;
       }
       String currentURL = (new URL(new URL(baseURL), line)).toString();
       if (currentURL.endsWith(".m3u8")) {
           String[] more = extractM3u8(currentURL, baseURL);  // Recursively add all
           Collections.addAll(links, more);
       } else {
           links.add(currentURL);
       }
   }
   return links.toArray(new String[links.size()]);
}

And then in fetchTwitterIframeVideos, we can return the all .ts URLs for the video –

if (url.endsWith(".mp4")) {
   return new String[]{url};
} else if (url.endsWith(".m3u8")) {
   return extractM3u8(url);
}

Putting things together

Finally, the TwitterScraper can discover the video links by tweaking a little –

if (input.indexOf("AdaptiveMedia-videoContainer") > 0) {
   // Fetch Tweet ID
   String tweetURL = props.get("tweetstatusurl").value;
   int slashIndex = tweetURL.lastIndexOf('/');
   if (slashIndex < 0) {
       continue;
   }
   String tweetID = tweetURL.substring(slashIndex + 1);
   String iframeURL = "https://twitter.com/i/videos/tweet/" + tweetID;
   String[] videoURLs = fetchTwitterIframeVideos(iframeURL);
   Collections.addAll(videos, videoURLs);
}

Conclusion

This blog post explained the process of extracting video URL from Twitter and the problem faced. The discussed change enabled loklak to extract and serve URLs to video for tweets. It was introduced in PR loklak/loklak_server#1193 by me (@singhpratyush).

The service was further enhanced to collect single mp4 link for videos (see PR loklak/loklak_server#1206), which is discussed in another blog post.

Resources

The m3u extension – https://www.lifewire.com/g00/m3u8-file-2621956.
Loklak’s TwitterScraper – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/TwitterScraper.java.
iFrames – https://www.w3schools.com/tags/tag_iframe.asp.

Fetching URL for Complete Twitter Videos in loklak server

Post author:Pratyush
Post published:July 8, 2017
Post category:FOSSASIA loklak
Post comments:1 Comment

In the previous blog post, I discussed how to fetch the URLs for Twitter videos in parts (.ts extension). But getting a video in parts is not beneficial as the loklak users have to carry out the following task in order to make sense out of it:

Placing the videos in correct order (the videos are divided into 3-second sections).
Having proper libraries and video player to play the .ts extension.

This would require fairly complex loklak clients and hence the requirement was to have complete video in a single link with a popular extension. In this blog post, I’ll be discussing how I managed to get links to complete Twitter videos.

Guests and Twitter Videos

Most of the content on Twitter is publicly accessible and we don’t need an account to access it. And this public content includes videos too. So, there should be some way in which Twitter would be handling guest users and serving them the videos. We needed to replicate the same flow in order to get links to those videos.

Problem with Twitter video and static HTML

In Twitter, the videos are not served with the static HTML of a page. It is generally rendered using a front-end JavaScript framework. Let us take an example of mobile.twitter.com website.

Let us consider the video from a tweet of @HiHonourIndia –

We can see that the page is rendered using ReactJS and we also have the direct link for the video –

“So what’s the problem then? We can just request the web page and parse HTML to get video link, right?”

Wrong. As I mentioned earlier, the pages are rendered using React and when we initially request it, it looks something like this –

The HTML contains no link to video whatsoever, and keeping in mind that we would be getting the previously mentioned HTML, the scraper wouldn’t be getting any video link either.

We, therefore, need to mimic the flow which is followed internally in the web app to get the video link and play them.

Mimicking the flow of Twitter Mobile to get video links

After tracking the XHR requests made to by the Twitter Mobile web app, one can come up with the forthcoming mentioned flow to get video URLs.

Mobile URL for a Tweet

Getting mobile URL for a tweet is very simple –

String mobileUrl = "https://mobile.twitter.com" + tweetUrl;

Here, tweet URL is of the type /user/tweetID.

Guest Token and Bearer JS URL

The Bearer JS is a file which contains Bearer Token which along with a Guest Token is used to authenticate Twitter API to get details about a conversation. The guest token and bearer script URL can be extracted from the static mobile page –

Pattern bearerJsUrlRegex = Pattern.compile(“showFailureMessage\\(\\'(.*?main.*?)\\’\\);”);
Pattern guestTokenRegex = Pattern.compile(“document\\.cookie \\= decodeURIComponent\\(\\\”gt\\=([0-9]+);”);
ClientConnection conn = new ClientConnection(mobileUrl);
BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
String line;
while ((line = br.readLine()) != null) {
   if (bearerJsUrl != null && guestToken != null) {
       // Both the entities are found
       break;
   }
   if (line.length() == 0) {
       continue;
   }
   Matcher m = bearerJsUrlRegex.matcher(line);
   if (m.find()) {
       bearerJsUrl = m.group(1);
       continue;
   }
   m = guestTokenRegex.matcher(line);
   if (m.find()) {
       guestToken = m.group(1);
   }
}

[SOURCE]

Getting Bearer Token from Bearer JS URL

The following simple method can be used to fetch the Bearer Token from URL –

private static final Pattern bearerTokenRegex = Pattern.compile(“BEARER_TOKEN:\\\”(.*?)\\\””);
private static String getBearerTokenFromJs(String jsUrl) throws IOException {
   ClientConnection conn = new ClientConnection(jsUrl);
   BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
   String line = br.readLine();
   Matcher m = bearerTokenRegex.matcher(line);
   if (m.find()) {
       return m.group(1);
   }
   throw new IOException(“Couldn\’t get BEARER_TOKEN”);
}

[SOURCE]

Using the Guest Token and Bearer Token to get Video Links

The following method demonstrates the process of getting video links once we have all the required information –

private static String[] getConversationVideos(String tweetId, String bearerToken, String guestToken) throws IOException {
   String conversationApiUrl = “https://api.twitter.com/2/timeline/conversation/” + tweetId + “.json”;
   CloseableHttpClient httpClient = getCustomClosableHttpClient(true);
   HttpGet req = new HttpGet(conversationApiUrl);
   req.setHeader(“User-Agent”, “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36”);
   req.setHeader(“Authorization”, “Bearer “ + bearerToken);
   req.setHeader(“x-guest-token”, guestToken);
   HttpEntity entity = httpClient.execute(req).getEntity();
   String html = getHTML(entity);
   consumeQuietly(entity);
   try {
       JSONArray arr = (new JSONObject(html)).getJSONObject(“globalObjects”).getJSONObject(“tweets”)
               .getJSONObject(tweetId).getJSONObject(“extended_entities”).getJSONArray(“media”);
       JSONObject obj2 = (JSONObject) arr.get(0);
       JSONArray videos = obj2.getJSONObject(“video_info”).getJSONArray(“variants”);
       ArrayList<String> urls = new ArrayList<>();
       for (int i = 0; i < videos.length(); i++) {
           String url = ((JSONObject) videos.get(i)).getString(“url”);
           urls.add(url);
       }
       return urls.toArray(new String[urls.size()]);
   } catch (JSONException e) {
       // This is not an issue. Sometimes, there are videos in long conversations but other ones get media class
       //  div, so this fetching process is triggered.
   }
   return new String[]{};
}

[SOURCE]

Checking if a Tweet contains video

If a tweet contains a video, we can add the following lines to recognise it in TwitterScraper.java –

if (input.indexOf(“AdaptiveMedia-videoContainer”) > 0) {
   // Do necessary things
}

[SOURCE]

Limitations

Though this method successfully extracts the video links to complete Twitter videos, it makes the scraping process very slow. This is because, for every tweet that contains a video, three HTTP requests are made in order to finalise the tweet. And keeping in mind that there are up to 20 Tweets per search from Twitter, we get instances where more than 10 of them are videos (30 HTTP requests). Also, there is a lot of JSON and regex processing involved which adds a little to the whole “slow down” thing.

Conclusion

This post explained how loklak server was improved to fetch links to complete video URLs from Twitter and the exact flow of requests in order to achieve so. The changes were proposed in pull requests loklak/loklak_server#1206.

Resources

TwitterScraper.java after the patch – https://github.com/loklak/loklak_server/blob/7c513ae7949718399a7cf06bb19e80da362ef396/src/org/loklak/harvester/TwitterScraper.java.
Twitter API documentation – https://dev.twitter.com/docs.
Monitoring XHR requests in Chrome web browser – https://www.codexworld.com/how-to/monitor-ajax-requests-google-chrome/.
React dev tools (chrome extension) on Github – https://github.com/facebook/react-devtools.
A blog post explaining about utilising XHR requests to scrape – https://www.codementor.io/codementorteam/how-to-scrape-an-ajax-website-using-python-qw8fuitvi.

URL Unshortening in Java for loklak server

Post author:Pratyush
Post published:July 3, 2017
Post category:GSoC loklak
Post comments:0 Comments

There are many URL shortening services on the internet. They are useful in converting really long URLs to shorter ones. But apart from redirecting to a longer URL, they are often used to track the people visiting those links.

One of the components of loklak server is its URL unshortening and redirect resolution service, which ensures that websites can’t track the users using those links and enhances the protection of privacy. How this service works in loklak.

Redirect Codes in HTTP

Various standards define 3XX status codes as an indication that the client must perform additional actions to complete the request. These response codes range from 300 to 308, based on the type of redirection.

To check the redirect code of a request, we must first make a request to some URL –

String urlstring = "http://tinyurl.com/8kmfp";
HttpRequestBase req = new HttpGet(urlstring);

Next, we will configure this request to disable redirect and add a nice Use-Agent so that websites do not block us as a robot –

req.setConfig(RequestConfig.custom().setRedirectsEnabled(false).build());
req.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36");

Now we need a HTTP client to execute this request. Here, we will use Apache’s CloseableHttpClient –

CloseableHttpClient httpClient = HttpClients.custom()
                                   .setConnectionManager(getConnctionManager(true))
                                   .setDefaultRequestConfig(defaultRequestConfig)
                                   .build();

The getConnctionManager returns a pooling connection manager that can reuse the existing TCP connections, making the requests very fast. It is defined in org.loklak.http.ClientConnection.

Now we have a client and a request. Let’s make our client execute the request and we shall get an HTTP entity on which we can work.

HttpResponse httpResponse = httpClient.execute(req);
HttpEntity httpEntity = httpResponse.getEntity();

Now that we have executed the request, we can check the status code of the response by calling the corresponding method –

if (httpEntity != null) {
   int httpStatusCode = httpResponse.getStatusLine().getStatusCode();
   System.out.println("Status code - " + httpStatusCode);
} else {
   System.out.println("Request failed");
}

Hence, we have the HTTP code for the requests we make.

Getting the Redirect URL

We can simply check for the value of the status code and decide whether we have a redirect or not. In the case of a redirect, we can check for the “Location” header to know where it redirects.

if (300 <= httpStatusCode && httpStatusCode <= 308) {
   for (Header header: httpResponse.getAllHeaders()) {
       if (header.getName().equalsIgnoreCase("location")) {
           redirectURL = header.getValue();
       }
   }
}

Handling Multiple Redirects

We now know how to get the redirect for a URL. But in many cases, the URLs redirect multiple times before reaching a final, stable location. To handle these situations, we can repeatedly fetch redirect URL for intermediate links until we saturate. But we also need to take care of cyclic redirects so we set a threshold on the number of redirects that we have undergone –

String urlstring = "http://tinyurl.com/8kmfp";
int termination = 10;
while (termination-- > 0) {
   String unshortened = getRedirect(urlstring);
   if (unshortened.equals(urlstring)) {
       return urlstring;
   }
   urlstring = unshortened;
}

Here, getRedirect is the method which performs single redirect for a URL and returns the same URL in case of non-redirect status code.

Redirect with non-3XX HTTP status – meta refresh

In addition to performing redirects through 3XX codes, some websites also contain a <meta http-equiv=”refresh” … > which performs an unconditional redirect from the client side. To detect these types of redirects, we need to look into the HTML content of a response and parse the URL from it. Let us see how –

String getMetaRedirectURL(HttpEntity httpEntity) throws IOException {
   StringBuilder sb = new StringBuilder();
   BufferedReader reader = new BufferedReader(new InputStreamReader(httpEntity.getContent()));
   String content = null;
   while ((content = reader.readLine()) != null) {
       sb.append(content);
   }
   String html = sb.toString();
   html = html.replace("\n", "");
   if (html.length() == 0)
       return null;
   int indexHttpEquiv = html.toLowerCase().indexOf("http-equiv=\"refresh\"");
   if (indexHttpEquiv < 0) {
       return null;
   }
   html = html.substring(indexHttpEquiv);
   int indexContent = html.toLowerCase().indexOf("content=");
   if (indexContent < 0) {
       return null;
   }
   html = html.substring(indexContent);
   int indexURLStart = html.toLowerCase().indexOf(";url=");
   if (indexURLStart < 0) {
       return null;
   }
   html = html.substring(indexURLStart + 5);
   int indexURLEnd = html.toLowerCase().indexOf("\"");
   if (indexURLEnd < 0) {
       return null;
   }
   return html.substring(0, indexURLEnd);
}

This method tries to find the URL from meta tag and returns null if it is not found. This can be called in case of non-redirect status code as a last attempt to fetch the URL –

String getRedirect(String urlstring) throws IOException {
   ...
   if (300 <= httpStatusCode && httpStatusCode <= 308) {
       ...
   } else {
       String metaURL = getMetaRedirectURL(httpEntity);
       EntityUtils.consumeQuietly(httpEntity);
       if (metaURL != null) {
           if (!metaURL.startsWith("http")) {
               URL u = new URL(new URL(urlstring), metaURL);
               return u.toString();
           }
           return metaURL;
       }
   return urlstring;
   }
   ...
}

In this implementation, we can see that there is a check for metaURL starting with http because there may be relative URLs in the meta tag. The java.net.URL library is used to create a final URL string from the relative URL. It can handle all the possibilities of a valid relative URL.

Conclusion

This blog post explains about the resolution of shortened and redirected URLs in Java. It explains about defining requests, executing them using an HTTP client and processing the resultant response to get a redirect URL. It also explains about how to perform these operations repeatedly to process multiple shortenings/redirects and finally to fetch the redirect URL from meta tag.

Loklak uses an inbuilt URL shortener to resolve redirects for a URL. If you find this blog post interesting, please take a look at the URL shortening service of loklak.

Improving Harvesting Decision for Kaizen Harvester in loklak server

Post author:Pratyush
Post published:July 3, 2017
Post category:GSoC loklak
Post comments:0 Comments

About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

Hashtags in Tweets
User mentions in Tweets
Tweets from areas near to each Tweet in timeline.
Tweets older than oldest Tweet in timeline.

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by @yukiisbored in pull request loklak/loklak_server#960.

The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue, or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();

   grabSuggestions();

   return 0;
}

[SOURCE]

In sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

Calculate the ratio of queue filled (q.size() / q.maxSize()).
Generate a random floating point number between 0 and 1.
If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions.

Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

Change in code

The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in following manner –

@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}

[SOURCE]

Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to backend for suggestions whose queries are not added to the queue.

Resources

Current state of Kaizen Harvester – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy/KaizenHarvester.java.
Kaizen Harvester usage guide – https://github.com/loklak/loklak_server/blob/development/docs/kaizen.md.
Code used to generate the graph – https://gist.github.com/singhpratyush/8292b6fc815e5a18311848f635724f99.

Using NodeBuilder to instantiate node based Elasticsearch client and Visualising data

Post author:Pratyush
Post published:May 10, 2017
Post category:GSoC loklak Tutorial
Post comments:0 Comments

As elastic.io mentions, Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. But in many setups, it is not possible to manually install an Elasticsearch node on a machine. To handle these type of scenarios, Elasticsearch provides the NodeBuilder module, which can be used to spawn Elasticsearch node programmatically. Let’s see how.

Getting Dependencies

In order to get the ES Java API, we need to add the following line to dependencies.

compile group: 'org.elasticsearch', name: 'securesm', version: '1.0'

The required packages will be fetched the next time we gradle build.

Configuring Settings

In the Elasticsearch Java API, Settings are used to configure the node(s). To create a node, we first need to define its properties.

Settings.Builder settings = new Settings.Builder();

settings.put("cluster.name", "cluster_name");  // The name of the cluster

// Configuring HTTP details
settings.put("http.enabled", "true");
settings.put("http.cors.enabled", "true");
settings.put("http.cors.allow-origin", "https?:\/\/localhost(:[0-9]+)?/");  // Allow requests from localhost
settings.put("http.port", "9200");

// Configuring TCP and host
settings.put("transport.tcp.port", "9300");
settings.put("network.host", "localhost");

// Configuring node details
settings.put("node.data", "true");
settings.put("node.master", "true");

// Configuring index
settings.put("index.number_of_shards", "8");
settings.put("index.number_of_replicas", "2");
settings.put("index.refresh_interval", "10s");
settings.put("index.max_result_window", "10000");

// Defining paths
settings.put("path.conf", "/path/to/conf/");
settings.put("path.data", "/path/to/data/");
settings.put("path.home", "/path/to/data/");

settings.build();  // Buid with the assigned configurations

There are many more settings that can be tuned in order to get desired node configuration.

Building the Node and Getting Clients

The Java API makes it very simple to launch an Elasticsearch node. This example will make use of settings that we just built.

Node elasticsearchNode = NodeBuilder.nodeBuilder().local(false).settings(settings).node();

A piece of cake. Isn’t it? Let’s get a client now, on which we can execute our queries.

Client elasticsearhClient = elasticsearchNode.client();

Shutting Down the Node

elasticsearchNode.close();

A nice implementation of using the module can be seen at ElasticsearchClient.java in the loklak project. It uses the settings from a configuration file and builds the node using it.

Visualisation using elasticsearch-head

So by now, we have an Elasticsearch client which is capable of doing all sorts of operations on the node. But how do we visualise the data that is being stored? Writing code and running it every time to check results is a lengthy thing to do and significantly slows down development/debugging cycle.

To overcome this, we have a web frontend called elasticsearch-head which lets us execute Elasticsearch queries and monitor the cluster.

To run elasticsearch-head, we first need to have grunt-cli installed –

$ sudo npm install -g grunt-cli

Next, we will clone the repository using git and install dependencies –

$ git clone git://github.com/mobz/elasticsearch-head.git
$ cd elasticsearch-head
$ npm install

Next, we simply need to run the server and go to indicated address on a web browser –

$ grunt server

At the top, enter the location at which elasticsearch-head can interact with the cluster and Connect.

Upon connecting, the dashboard appears telling about the status of cluster –

The dashboard shown above is from the loklak project (will talk more about it).

There are 5 major sections in the UI –
1. Overview: The above screenshot, gives details about the indices and shards of the cluster.
2. Index: Gives an overview of all the indices. Also allows to add new from the UI.
3. Browser: Gives a browser window for all the documents in the cluster. It looks something like this –

The left pane allows us to set the filter (index, type and field). The table listed is sortable. But we don’t always get what we are looking for manually. So, we have the following two sections.
4. Structured Query: Gives a dead simple UI that can be used to make a well structured request to Elasticsearch. This is what we need to search for to get Tweets from @gsoc that are indexed –

5. Any Request: Gives an advance console that allows executing any query allowable by Elasticsearch API.

A little about the loklak project and Elasticsearch

loklak is a server application which is able to collect messages from various sources, including twitter. The server contains a search index and a peer-to-peer index sharing interface. All messages are stored in an elasticsearch index.

Source: github/loklak/loklak_server

The project uses Elasticsearch to index all the data that it collects. It uses NodeBuilder to create Elasticsearch node and process the index. It is flexible enough to join an existing cluster instead of creating a new one, just by changing the configuration file.

Conclusion

This blog post tries to explain how NodeBuilder can be used to create Elasticsearch nodes and how they can be configured using Elasticsearch Settings.

It also demonstrates the installation and basic usage of elasticsearch-head, which is a great library to visualise and check queries against an Elasticsearch cluster.

The official Elasticsearch documentation is a good source of reference for its Java API and all other aspects.