Data Indexing in Loklak Server

Loklak Server is a data-scraping system that indexes all the scraped data for the purpose to optimize it. The data fetched by different users is stored as cache. This helps in retrieving of data directly from cache for recurring queries. When users search for the same queries, load on Loklak Server is reduced by outputting indexed data, thus optimizing the operations.

Application

It is dependent on ElasticSearch for indexing of cached data (as JSON). The data that is fetched by different users is stored as cache. This helps in fetching data directly from cache for same queries. When users search for the same queries, load on Loklak Server is reduced and it is optimized by outputting indexed data instead of scraping the same date again.

When is data indexing done?

The indexing of data is done when:

1) Data is scraped:

When data is scraped, data is indexed concurrently while cleaning of data in TwitterTweet data object. For this task, addScheduler static method of IncomingMessageBuffer is used, which acts as

abstract between scraping of data and storing and indexing of data.

The following is the implementation from TwitterScraper (from here). Here writeToIndex is the boolean input to whether index the data or not.

if (this.writeToIndex) IncomingMessageBuffer.addScheduler(this, this.user, true);

2) Data is fetched from backend:

When data is fetched from backend, it is indexed in Timeline iterator. It calls the above method to index data concurrently.

The following is the definition of writeToIndex() method from Timeline.java (from here). When writeToIndex() is called, the fetched data is indexed.

public void writeToIndex() {
    IncomingMessageBuffer.addScheduler(this, true);
}

How?

When addScheduler static method of IncomingMessageBuffer is called, a thread is started that indexes all data. When the messagequeue data structure is filled with some messages, indexing continues.

See here . The DAO method writeMessageBulk is called here to write data. The data is then written to the following streams:

1) Dump: The data fetched is dumped into Import directory in a file. It can also be fetched from other peers.

2) Index: The data fetched is checked if it exists in the index and data that isn’t indexed is indexed.

public static Set<String> writeMessageBulk(Collection<MessageWrapper> mws) {
    List<MessageWrapper> noDump = new ArrayList<>();
    List<MessageWrapper> dump = new ArrayList<>();
    for (MessageWrapper mw: mws) {
        if (mw.t == null) continue;
        if (mw.dump) dump.add(mw);
        else noDump.add(mw);
    }

    Set<String> createdIDs = new HashSet<>();
    createdIDs.addAll(writeMessageBulkNoDump(noDump));
    createdIDs.addAll(writeMessageBulkDump(dump));

    // Does also do an writeMessageBulkNoDump internally
    return createdIDs;
}

 

The above code snippet is from DAO.java, method calls writeMessageBulkNoDump(noDump) indexes the data to ElasticSearch. The definition of this method can be seen here

Whereas for dumping of data writeMessageBulkDump(Dump) is called. It is defined here

Resources:

Continue ReadingData Indexing in Loklak Server

Some Other Services in Loklak Server

Loklak Server isn’t just a scraper system software, it provides numerous other services to perform other interesting functions like Link Unshortening (reverse of link shortening) and video fetching and administrative tasks like status fetching of the Loklak deployment (for analysis in Loklak development use) and many more. Some of these are internally implemented and rest can be used through http endpoints. Also there are some services which aren’t complete and are in development stage.

Let’s go through some of them to know a bit about them and how they can be used.

1) VideoUrlService

This is the service to extract video from the website that has a streaming video and output the video file link. This service is in development stage and is functional. Presently, It can fetch twitter video links and output them with different video qualities.

Endpoint: /api/videoUrlService.json

Implementation Example:

curl api/loklak.org/api/videoUrlService.json?id=https://twitter.com/EXOGlobal/status/886182766970257409&id=https://twitter.com/KMbappe/status/885963850708865025

2) Link Unshortening Service

This is the service used to unshorten the link. There are shortened URLs which are used to track the Internet Users by Websites. To prevent this, link unshortening service unshortens the link and returns the final untrackable link to the user.

Currently this service is in application in TwitterScraper to unshorten the fetched URLs. It has other methods to get Redirect Link and also a link to get final URL from multiple unshortened link.

Implementation Example from TwitterScraper.java [LINK]:

Matcher m = timeline_link_pattern.matcher(text);

if (m.find()) {
    String expanded = RedirectUnshortener.unShorten(m.group(2));
    text = m.replaceFirst(" " + expanded);
    continue;
}

 

Further it can be used to as a service and can be used directly. New features like fetching featured image from links can be added to this service. Though these stuff are in discussion and enthusiastic contribution is most welcomed.

3) StatusService

This is a service that outputs all data related to to Loklak Server deployment’s configurations. To access this configuration, api endpoint status.json is used.

It outputs the following data:

a) About the number of messages it scrapes in an interval of a second, a minute, an hour, a day, etc.

b) The configuration of the server like RAM, assigned memory, used memory, number of cores of CPU, cpu load, etc.

c) And other configurations related to the application like size of ElasticSearch shards size and their specifications, client request header, number of running threads, etc.

Endpoint: /api/status.json

Implementation Example:

curl api/loklak.org/api/status.json

Resources:

 

Continue ReadingSome Other Services in Loklak Server

Using Elasticsearch Aggregations to Analyse Classifier Data in loklak Server

Loklak uses Elasticsearch to index Tweets and other social media entities. It also houses a classifier that classifies Tweets based on emotion, profanity and language. But earlier, this data was available only with the search API and there was no way to get aggregated data out of it. So as a part of my GSoC project, I proposed to introduce a new API endpoint which would allow users to access aggregated data from these classifiers.

In this blog post, I will be discussing how aggregations are performed on the Elasticsearch index of Tweets in the loklak server.

Structure of index

The ES index for Twitter is called messages and it has 3 fields related to classifiers –

  1. classifier_emotion
  2. classifier_language
  3. classifier_profanity

With each of these classifiers, we also have a probability attached which represents the confidence of the classifier for assigned class to a Tweet. The name of these fields is given by suffixing the emotion field by _probability (e.g. classifier_emotion_probability).

Since I will also be discussing aggregation based on countries in this blog post, there is also a field named place_country_code which saves the ISO 3166-1 alpha-2 code for the country of creation of Tweet.

Requesting aggregations using Elasticsearch Java API

Elasticsearch comes with a simple Java API which can be used to perform any desired task. To work with data, we need an ES client which can be built from a ES Node (if creating a cluster) or directly as a transport client (if connecting remotely to a cluster) –

// Transport client
TransportClient tc = TransportClient.builder()
                        .settings(mySettings)
                        .build();

// From a node
Node elasticsearchNode = NodeBuilder.nodeBuilder()
                            .local(false).settings(mySettings)
                            .node();
Client nc = elasticsearchNode.client();

[SOURCE]

Once we have a client, we can use ES AggregationBuilder to get aggregations from an index –

SearchResponse response = elasticsearchClient.prepareSearch(indexName)
                            .setSearchType(SearchType.QUERY_THEN_FETCH)
                            .setQuery(QueryBuilders.matchAllQuery())  // Consider every row
                            .setFrom(0).setSize(0)  // 0 offset, 0 result size (do not return any rows)
                            .addAggregation(aggr)  // aggr is a AggregatoinBuilder object
                            .execute().actionGet();  // Execute and get results

[SOURCE]

AggregationBuilders are objects that define the properties of an aggregation task using ES’s Java API. This code snippet is applicable for any type of aggregation that we wish to perform on an index, given that we do not want to fetch any rows as a response.

Performing simple aggregation for a classifier

In this section, I will discuss the process to get results from a given classifier in loklak’s ES index. Here, we will be targeting a class-wise count of rows and stats (average and sum) of probabilities.

Writing AggregationBuilder

An AggregationBuilder for this task will be a Terms AggregationBuilder which would dynamically generate buckets for all the different values of fields for a given field in index –

AggregationBuilder getClassifierAggregationBuilder(String classifierName) {
    String probabilityField = classifierName + "_probability";
    return AggregationBuilders.terms("by_class").field(classifierName)
        .subAggregation(
            AggregationBuilders.avg("avg_probability").field(probabilityField)
        )
        .subAggregation(
            AggregationBuilders.sum("sum_probability").field(probabilityField)
        );
}

[SOURCE]

Here, the name of aggregation is passed as by_class. This will be used while processing the results for this aggregation task. Also, sub-aggregation is used to get average and sum probability by the name of avg_probability and sum_probability respectively. There is no need to specify to count rows as this is done by default.

Processing results

Once we have executed the aggregation task and received the SearchResponse as sr (say), we can use the name of top level aggregation to get Terms aggregation object –

Terms aggrs = sr.getAggregations().get("by_class");

After that, we can iterate through the buckets to get results –

for (Terms.Bucket bucket : aggrs.getBuckets()) {
String key = bucket.getKeyAsString();
long docCount = bucket.getDocCount(); // Number of rows
// Use name of sub aggregations to get results
Sum sum = bucket.getAggregations().get("sum_probability");
Avg avg = bucket.getAggregations().get("avg_probability");
// Do something with key, docCount, sum and avg
}

[SOURCE]

So in this manner, results from aggregation response can be processed.

Performing nested aggregations for different countries

The previous section described the process to perform aggregation over a single field. For this section, we’ll aim to get results for each country present in the index given a classifier field.

Writing a nested aggregation builder

To get the aggregation required, AggregationBuilder from previous section can be added as a sub-aggregation for the AggregationBuilder for country code field –

AggregationBuilder aggrs = AggregationBuilders.terms("by_country").field("place_country_code")
    .subAggregation(
        AggregationBuilders.terms("by_class").field(classifierName)
            .subAggregation(
                AggregationBuilders.avg("avg_probability").field(probabilityField)
            )
            .subAggregation(
                AggregationBuilders.sum("sum_probability").field(probabilityField)
            );
    );

[SOURCE]

Processing the results

Again, we can get the results by processing the AggregationBuilders by name in a top-to-bottom fashion –

Terms aggrs = response.getAggregations().get("by_country");
for (Terms.Bucket bucket : aggr.getBuckets()) {
    String countryCode = bucket.getKeyAsString();
    Terms classAggrs = bucket.getAggregations().get("by_class");
    for (Terms.Bucket classBucket : classAggr.getBuckets()) {
        String key = classBucket.getKeyAsString();
        long docCount = classBucket.getDocCount();
        Sum sum = classBucket.getAggregations().get("sum_probability");
        Avg avg = classBucket.getAggregations().get("avg_probability");
        ...
    }
    ...
}

[SOURCE]

And we have the data about classifier for each country present in the index.

Conclusion

This blog post explained about Elasticsearch aggregations and their usage in the loklak server project. The changes discussed here were introduced over a series of patches to ElasticsearchClient.java by @singhpratyush (me).

Resources

Continue ReadingUsing Elasticsearch Aggregations to Analyse Classifier Data in loklak Server

Lazy loading images in Loklak Search

Loklak Search delivers the media rich content to the users. Most of the media delivered to the users are in the form of images. In the earlier versions of loklak search, these images were delivered to the users imperatively, irrespective of their need. What this meant is, whether the image is required by the user or not it was delivered, consuming the bandwidth and slowing down the initial load of the app as large amount of data was required to be fetched before the page was ready. Also, the 404 errors were also not being handled, thus giving the feel of a broken UI.

So we required a mechanism to control this loading process and tap into its various aspects, to handle the edge cases. This, on the whole, required few new Web Standard APIs to enable smooth working of this feature. These API’s are

  • IntersectionObserver API
  • Fetch API

 

As the details of this feature are involving and comprise of new API standards, I have divided this into two posts, one with the basics of the above mentioned API’s and the outline approach to the module and its subcomponents and the difficulties which we faced. The second post will mostly comprise of the details of the code which went into making this feature and how we tackled the corner cases in the path.

Overview

Our goal here was to create a component which can lazily load the images and provide UI feedback to the user in case of any load or parse error. As mentioned above the development of this feature depends on the relatively new web standards API’s so it’s important to understand the functioning of these AP’s we understand how they become the intrinsic part of our LazyImgComponent.

Intersection Observer

If we see history, the problem of intersection of two elements on the web in a performant way has been impossible since always, because it requires DOM polling constantly for the ClientRect the element which we want to check for intersection, as these operations run on main thread these kinds of polling has always been a source of bottlenecks in the application performance.

The intersection observer API is a web standard to detect when two DOM elements intersect with each other. The intersection observer API helps us to configure a callback whenever an element called target intersects with another element (root) or viewport.

To create an intersection observer is a simple task we just have to create a new instance of the observer.

var observer = new IntersectionObserver(callback, options);

Here the callback is the function to run whenever some observed element intersect with the element supplied to the observer. This element is configured in the options object passed to the Intersection Observer

var options = {
root: document.querySelector('#root'), // Defaults to viewport if null
rootMargin: '0px', // The margin around root within which the callback is triggered
threshold: 1.0
}

The target element whose intersection is to be tested with the main element can be setup using the observe method of the observer.

var target = document.querySelector('#target');
observer.observe(target);

After this setup whenever the target element intersects with the root element the callback method is fired, and this provides the easy standard mechanism to get callbacks whenever the target element intersects with root element.

How this is used in Loklak Search?

Our goal here is to load the images only if required, ie. we should load the images lazily only if they are in the viewport. So the task of checking whether the element is near the viewport is done in a performant way using the Intersection Observer standard.

Fetch API

Fetch API provides interface for fetching resources. It provides us with generic Request and Response interfaces, which can be used as Streaming responses, requests from Service Worker or CacheAPI. This is a very lightweight API providing us with the flexibility and power to make the AJAX requests from anywhere, irrespective of context of the thread or the worker. It is also a Promise driven API. Thus, providing the remedy from the callback hell.

The basic fetch requests are really very easy to setup, all they require is the path to the resource to fetch, and they return a promise on which we can apply promise chaining to transform the response into desired form. A very simple example illustratuing the fetch API is as follows.

fetch('someResource.xyz’')
.then(function(response) {
return response.blob();
})
.then(function(respBlob) {
doSomethingWithBlob(respBlob);
});

The role of fetch api is pretty straight forward in the lazy loading of images, ie. to actually fetch the images when requested.

Lazy Image Component

We start designing our lazy image component by structuring our API. The API our component should provide to the outside world must be similar to Native Image Element, in terms of attributes and methods. So our component’s class should look something like this, with attributes src, alt, title, width and height. Also, we have hooked a load event which host can listen to for loading success or failure of the image.

@Component({
selector: 'app-lazy-img',
templateUrl: './lazy-img.component.html',
styleUrls: ['./lazy-img.component.scss'],
})
export class LazyImgComponent {
@Input() src: string;
@Input() width: number;
@Input() height: number;
@Input() alt: string;
@Input() title: string;
@Output() load: EventEmitter<boolean> = new EventEmitter<boolean>();
}

This basic API provides us with our custom <app-lazy-img> tag which we can use with the attributes, instead of standard <img> element to load the images when needed.

So our component is basically a wrapper around the standard img element to provide lazy loading behaviour.

This is the basic outline of how our component will be working.

  • The component registers with an Intersection observer, which notifies the element when it comes near the viewport.
  • Upon this notification, the component fetches the resource image.
  • When the fetch is complete, the retrieved binary stream is fed to the native image element which eventually renders the image on the screen.

This is the basic setup and working logic behind our lazy image component. In next post, I will be discussing the details of how this logic is achieved inside the LazyImgComponent, and how we solve the problems with a large number of elements rendered at once in the application.

Resources and Links

  • Intersection observer API
  • Intersection Observer polyfill for the browsers which don’t support Intersection Observer
  • Fetch API documentation
  • Fetch API polyfill for the browsers which don’t support fetch.
  • Loklak Search Repo
Continue ReadingLazy loading images in Loklak Search

Updating Page Titles Dynamically in Loklak Search

Page titles are native in the web platform, and are prime ways to identify any page. The page titles have been in the web platform since ages. They tell the browsers, the web scrapers and search engines about the page content in 1-2 words. Since the titles are used for wide variety of things from presentation of the page, history references and most importantly by the search engines to categorise the pages, it becomes very important for any web application to update the title of the page appropriately. In earlier implementation of loklak search the page title was a constant and was not updated regularly and this was not a good from presentation and SEO perspective.

Problem of page titles with SPA

Since loklak search is a single page application, there are few differences in the page title implementation in comparison to a server served multi page application. In a server served multi page application, the whole application is divided into pages and the server knows what page it is serving thus can easily set the title of the page while rendering the template. A simple example will be a base django template which holds the place to have a title block for the application.

<!-- base.html -->

<title>{% block title %} Lokalk Search {% endblock %}</title>

<!-- Other application blocks -->

Now for any other page extending this base.html it is very simple to update the title block by simply replacing it with it’s own title.

<!-- home.html -->

{% extendsbase.html%}

{% block title %} Home Page - Loklak Search {% endblock %}

<!-- Other page blocks -->

When the above template is rendered by the templating engine it replaces the title block of the base.html with the updated title block specific to the page, thus for each page at the rendering time server is able to update the page title, appropriately.

But in a SPA, the server just acts as REST endpoints, and all the templating is done at the client side. Thus in an SPA the page title never changes automatically, from the server, as only the client is in control of what page (route) it is showing. Thus it becomes the duty of the client side to update the title of the page, appropriately, and thus this issue of static non informative page titles is often overlooked.

Updating page titles in Loklak Search

Before being able to start solving the issue of updating the page titles it is certainly very important to understand what all are the points of change in the application where we need to update the page title.

  • Whenever the route in the application changes.
  • Whenever new query is fetched from the server.

These two are the most important places where we definitely want to update the titles. The way we achieved is using the Angular Title Service. The title service is a platform-browser service by angular which abstracts the workflow to achieve the title updation. There are are two main methods of this service get and set title, which can be used to achieve our desired behaviour. What title service do is abstract the extraction of Title Node and get and set the title values.

For updation of title for each page which is loaded we just attach an onInit lifecycle hook to the parent component of that page and, onInit we use the title service to update the title accordingly.

@Component({
selector: 'app-home',
templateUrl: './home.component.html',
})
export class HomeComponent implements OnInit, OnDestroy {
constructor(
private titleService: Title
) { }

ngOnInit() {
this.titleService.setTitle(Loklak Search');

// Other initialization attributes and methods
}
}

Similarly other pages according to their context update the page titles accordingly using the simple title service. This solves the basic case of updation of the titles of the page when the actual route path changes, and thus component’s onInit lifecycle hook is the best place to change the title of the page.

@Component({
selector: 'app-home',
templateUrl: './home.component.html',
})
export class HomeComponent implements OnInit, OnDestroy {
constructor(
private titleService: Title
) { }

ngOnInit() {
this.titleService.setTitle(Loklak Search');

// Other initialization attributes and methods
}
}

But when the actual route path doesn’t change and we want to update the title according to the query searched then it is not possible to do it using lifecycle hooks of the component. But fortunately, we are using the ngrx effects in our application and thus this task also again becomes much simpler to achieve in the application. In this situation again what we do is hook up a title change effect to SearchCompleteSuccessAction, and there we change the title accordingly.

@Effect({ dispatch: false })
resetTitleAfterSearchSuccess$: Observable<void>
= this.actions$
.ofType(apiAction.ActionTypes.SEARCH_COMPLETE_SUCCESS,
apiAction.ActionTypes.SEARCH_COMPLETE_FAIL)
.withLatestFrom(this.store$)
.map(([action, state]) => {
const displayString = state.query.displayString;
let title = `${displayString} - Loklak Search`;
if (action.type === apiAction.ActionTypes.SEARCH_COMPLETE_FAIL) {
title += ' - No Results';
}
this.titleService.setTitle(title);
});

Now if we look closely this effect is somewhat different from all the other effects. Firstly, the effect observable is of type void instead of type Action which is the case with other effects, and also there is is a { dispatch: false } argument passed to the constructor. Both these things are important of our resetTitle effect. As our reset title effect has no action to dispatch on it it’s execution the the observable is of type void instead of type Action, and we never want to dispatch an effect whose type is not an Action thus we set dispatch to false. Rest of the code for the effect is fairly simple, we filter all the actions and take SearchSuccess and SearchFail actions, then we get the latest value of the query display string from the store, and we use our title service to reset the title accordingly.

Conclusion

The titles are the important part of the web platform and are used by browsers and search engines to present and rank the relevance of the page. While developing a SPA it is even more important to maintain an updated title tag, as it is the only thing which actually changes about the page according to the context of the page. In loklak search, the title service is now used to update the titles of the page according to the search results.

Resources and Links

Continue ReadingUpdating Page Titles Dynamically in Loklak Search

Search Engine Optimization and Meta Tags in Loklak Search

Ranking higher in search results is very important for any website’s, productivity and reach. These days modern search engines use algorithms and scrape the sites for relevant data points, then these data points are processed to result in a relevance number aka the page ranking. Though these days the algorithms which search engines use are very advanced and are able to generate the context of the page by the page content, but still there are some key points which should be followed by the developers to enable a higher page ranking on search results along with better presentation of search results on the pages. Loklak Search is also a web application thus it is very important to get these crucial points correct.

So the first thing which search engines see on a website is their landing or index page, this page gives information to the search engine what the data is about in this site and then search engines follow links to to crawl the details of the page. So the landing page should be able to provide the exact context for the page. The page can provide the context using the meta tags which store the metadata information about the page, which can be used by the search engines and social sites

Meta Charset

  • The first and foremost important tag is the charset tag. It specifies the character set being used on the page and declares the page’s character encoding which is used by browsers encoding algorithm.
  • It very important to determine a correct and compatible charset for security and presentation reasons.
  • Thus the most widely used charset UTF-8 is used in loklak search.
<meta charset="utf-8">

Meta Viewport

  • The mobile browsers often load the page in a representative viewport which is usually larger than the actual screen of the device so that the content of the screen is not crammed into small space.
  • This makes the user to pan-zoom around the page to reach the desired content. But this approach often undesirable design for the most of the mobile websites.
  • For this at loklak search we use
<meta name="viewport" content="width=device-width, initial-scale=1">

 

This specifies the relation between CSS pixel and device pixel. Here the relationship is actually computed by the browser itself, but this meta tag says that the width for calculating that ratio should be equal to device width and the initial-scale should be 1 (or zoom should be 0).

Meta Description

  • The meta description tag is the most important tad for the SEO as this is the description which is used by search engines and social media sites while showing the description of the page
<meta name="description"
   content="Search social media on Loklak Search. Our mission is to make the world’s social media information openly accessible and useful generating open knowledge for all">

 

  • This is how this description tag is used by google on the Google Search

Social Media meta tags

  • The social media meta tags are important for the presentation of the content of the page when the link is shared in these sites.
  • As the link sharing is fairly simple in social media the links should always be able to show some context about the page in a visual form without the need to click the link actually.
  • For this purpose Facebook and Twitter read special kinds of meta tags to read and display information regarding the page whenever the link is shared.
  • For twitter cards data we have used
<!-- Twitter Card data -->
 <meta name="twitter:card" content="summary">
 <meta name="twitter:site" content="@fossasia">
 <meta name="twitter:title" content="Loklak Search">
 <meta name="twitter:description" content="Search social media on Loklak Search. Our mission is to make the world’s social media information openly accessible and useful generating open knowledge for all">
 <meta name="twitter:image" content="http://loklak.net/assets/images/cow_400x466.png">

 

This specifies the card to be of type summary card. There are many types of cards which are provided by twitter and  can be found here.

  • The facebook also uses the similar approach but with different meta tag names for their graph API.
 <!-- Open Graph data -->
 <meta property="og:title" content="Loklak Search">
 <meta property="og:type" content="website">
 <meta property="og:url" content="http://loklak.net">
 <meta property="og:image" content="http://loklak.net/assets/images/cow_400x466.png">
 <meta property="og:description" content="Search social media on Loklak Search. Our mission is to make the world’s social media information openly accessible and useful generating open knowledge for all">
 <meta property="og:site_name" content="Loklak Search">

Conclusion

Presentation of the content in the web plays a crucial role in reach and usage of the application. Wider reach and high usage are the core goals of Loklak Search project so that open data reaches the masses. To achieve this we are slowly moving for better sharing experience for the loklak pages using standard techniques. By using these we hope that we will be able to reach a wider audience, thus delivering the open content to masses.

Resources and Links

Continue ReadingSearch Engine Optimization and Meta Tags in Loklak Search

Making loklak Server’s Kaizen Harvester Extendable

Harvesting strategies in loklak are something that the end users can’t see, but they play a vital role in deciding the way in which messages are collected with loklak. One of the strategies in loklak is defined by the Kaizen Harvester, which generates queries from collected messages.

The original strategy used a simple hash queue which drops queries once it is full. This effect is not desirable as we tend to lose important queries in this process if they come up late while harvesting. To overcome this behaviour without losing important search queries, we needed to come up with new harvesting strategy(ies) that would provide a better approach for harvesting. In this blog post, I am discussing the changes made in the kaizen harvester so it can be extended to create different flavors of harvesters.

What can be different in extended harvesters?

To make the Kaizen harvester extendable, we first needed to decide that what are the parts in the original Kaizen harvester that can be changed to make the strategy different (and probably better).

Since one of the most crucial part of the Kaizen harvester was the way it stores queries to be processed, it was one of the most obvious things to change. Another thing that should be allowed to configure across various strategies was the decision of whether to go for harvesting the queries from the query list.

Query storage with KaizenQueries

To allow different methods of storing the queries, KaizenQueries class was introduced in loklak. It was configured to provide basic methods that would be required for a query storing technique to work. A query storing technique can be any data structure that we can use to store search queries for Kaizen harvester.

public abstract class KaizenQueries {

     public abstract boolean addQuery(String query);

     public abstract String getQuery();

     public abstract int getSize();

     public abstract int getMaxSize();

     public boolean isEmpty() {
         return this.getSize() == 0;
     }
}

[SOURCE]

Also, a default type of KaizenQueries was introduced to use in the original Kaizen harvester. This allowed the same interface as the original queue which was used in the harvester.

Another constructor was introduced in Kaizen harvester which allowed setting the KaizenQueries for an instance of its derived classes. It solved the problem of providing an interface of KaizenQueries inside the Kaizen harvester which can be used by any inherited strategy –

private KaizenQueries queries = null;

public KaizenHarvester(KaizenQueries queries) {
    ...
    this.queries = queries;
    ...
}

public void someMethod() {
    ...
    this.queries.addQuery(someQuery);
    ...
}

[SOURCE]

This being added, getting new queries or adding new queries was a simple. We just need to use getQuery() and addQuery() methods without worrying about the internal implementations.

Configurable decision for harvesting

As mentioned earlier, the decision taken for harvesting should also be configurable. For this, a protected method was implemented and used in harvest() method –

protected boolean shallHarvest() {
    float targetProb = random.nextFloat();
    float prob = 0.5F;
    if (this.queries.getMaxSize() > 0) {
        prob = queries.getSize() / (float)queries.getMaxSize();
    }
    return !this.queries.isEmpty() && targetProb < prob;
}

@Override
public int harvest() {
    if (this.shallHarvest()) {
        return harvestMessages();
    }

    grabSuggestions();

    return 0;
}

[SOURCE]

The shallHarvest() method can be overridden by derived classes to allow any type of harvesting decision that they want. For example, we can configure it in such a way that it harvests if there are any queries in the queue, or to use a different probability distribution instead of linear (maybe gaussian around 1).

Conclusion

This blog post explained about the changes made in loklak’s Kaizen harvester which allowed other harvesting strategies to be built on its top. It discussed the two components changed and how they allowed ease of inheriting the original Kaizen harvester. These changes were proposed in PR loklak/loklak_server#1203 by @singhpratyush (me).

Resources

Continue ReadingMaking loklak Server’s Kaizen Harvester Extendable

Accessing Child Component’s API in Loklak Search

Loklak search being an angular application, comprises of components. Components provide us a way to organize the application in a more consistent way, along with providing the ability to reuse code in the application. Each component has two type of API’s public and private. Public API is the API which it exposes to the outer world for manipulating the working of the component, while private API is something which is local to the component and cannot be directly accessed by the outside world. Now when this distinction between the two is clear, it is important to state the need of these API’s, and why are they required in loklak search.

The components can never live in isolation, i.e. they have to communicate with their parent to be able to function properly. Same is the case with components of loklak search. They have to interact with others to make the application work. So how this, interaction looks like,

The rule of thumb here is, data flows down, events flow up. This is the core idea of all the SPA frameworks of modern times, unidirectional data flow, and these interactions can be seen everywhere in loklak search.

<feed-header
   [query]="query"
   (searchEvent)="doSearch($event)"></feed-header>

This is how a simple component’s API looks in loklak search. Here our component is FeedHeader and it exposes some of it’s API as inputs and outputs.

export class FeedHeaderComponent {

 @Input() query: string;

 @Output() searchEvent: EventEmitter<string> = new EventEmitter<string>();

  // Other methods and properties of the component
}

The FeedHeaderComponent ‘s class defines some inputs which it takes. These inputs are the data given to the component. Here the input is a simple query property, and the parent at the time of instantiating the component, passes the value to it’s child as [query]=”query”. This enables the one direction of API, from parent to child. Now, we also need a way for parent to be able to events generated by the child on interaction with user. For example, here we need to have a way to tell the parent to perform a search whenever user presses search button. For this the Output property searchEvent is used. The search event can be emitted by the child component independently. While the parent, if it wants to listen to child components simply do so by binding to the event and running a corresponding function whenever event is emitted (searchEvent)=”doSearch($event)”. Here the event which parent listens to is searchEvent and whenever such an event is emitted by the child a function doSearch is run by the parent. Thus this completes the event flow, from child to parent.

Now it is worth noticing that all these inputs for data and outputs for events is provided by the child component itself. They are the API of the child and parent’s job is just to bind to these inputs and outputs to bind to data and listen to events. This allows the component interactions in both directions.

@ViewChild and triggering child’s methods

The inputs are important to carry data from the parent to the child, declaratively but sometimes it is necessary for the parent to access the public API of it’s child more directly, specially the API methods to trigger an action. These methods require the way for the parent to access its child component. This is done by @ViewChild decorator. The child element which the parent wants access to, have to declare the component as, one of it’s attributes. Like in our example, the FeedHeaderComponent needs access to its child component SuggestBoxComponent, to show/hide suggest box as and when required. So here the feed header component gets the access to its child using viewchild decorator.

export class FeedHeaderComponent {

  @ViewChild(‘#suggestBox) suggestBox: SuggestBoxComponent;

  // Other properties and methods

  toggleSuggestBox() {

     this.suggestBox.toggle();
  }
}

The SuggestBoxComponent here has a public method toggle() which toggles the visibility state of the suggest box. This method is available as a component’s public API method. The parent of this component calls this method using the @ViewChild reference which it grabbed at the time of view instantiation.

export class SuggestBoxComponent {

  private suggestBoxVisible = true;

  public toggle() {

     this.suggestBoxVisible = !this.suggestBoxVisible;
  }
}

Resources and Links

  • Angular document pages
  • Basic Usage in Angular tour of heroes tutorial
  • In depth usage blog for Inputs and Outputs SitePoint Tutorial
  • Loklak Search Repo
Continue ReadingAccessing Child Component’s API in Loklak Search

Posting Scraped Tweets to Loklak server from Loklak Wok Android

Loklak Wok Android is a peer harvester that posts collected  messages to the Loklak Server. The suggestions to search tweets are fetched using suggest API endpoint. Using the suggestion queries, tweets are scraped. The scraped tweets are shown in a RecyclerView and simultaneously they are posted to loklak server using push API endpoint. Let’s see how this is implemented.

Adding Dependencies to the project

This feature heavily uses Retrofit2, Reactive extensions(RxJava2, RxAndroid and Retrofit RxJava adapter) and RetroLambda (for Java lambda support in Android).

In app/build.gradle:

apply plugin: 'com.android.application'
apply plugin: 'me.tatarka.retrolambda'

android {
   ...
   packagingOptions {
       exclude 'META-INF/rxjava.properties'
   }
}

dependencies {
   ...
   compile 'com.google.code.gson:gson:2.8.1'

   compile 'com.squareup.retrofit2:retrofit:2.3.0'
   compile 'com.squareup.retrofit2:converter-gson:2.3.0'
   compile 'com.squareup.retrofit2:adapter-rxjava2:2.3.0'

   compile 'io.reactivex.rxjava2:rxjava:2.0.5'
   compile 'io.reactivex.rxjava2:rxandroid:2.0.1'
}

 

In build.gradle project level:

dependencies {
   classpath 'com.android.tools.build:gradle:2.3.3'
   classpath 'me.tatarka:gradle-retrolambda:3.2.0'
}

 

Implementation

The suggest and push API endpoint is defined in LoklakApi interface

public interface LoklakApi {

   @GET("/api/suggest.json")
   Observable<SuggestData> getSuggestions(@Query("q") String query, @Query("count") int count);

   @POST("/api/push.json")
   @FormUrlEncoded
   Observable<Push> pushTweetsToLoklak(@Field("data") String data);
}

 

The POJOs (Plain Old Java Objects) for suggestions and posting tweets are obtained using jsonschema2pojo, Gson uses POJOs to convert JSON to Java objects.

The REST client is created by Retrofit2 and is implemented in RestClient class. The Gson converter and RxJava adapter for retrofit is added in the retrofit builder. create method is called to generate the API methods(retrofit implements LoklakApi Interface).

public class RestClient {

   private RestClient() {
   }

   private static void createRestClient() {
       sRetrofit = new Retrofit.Builder()
               .baseUrl(BASE_URL)
               // gson converter
               .addConverterFactory(GsonConverterFactory.create(gson))
               // retrofit adapter for rxjava
               .addCallAdapterFactory(RxJava2CallAdapterFactory.create())
               .build();
   }

   private static Retrofit getRetrofitInstance() {
       if (sRetrofit == null) {
           createRestClient();
       }
       return sRetrofit;
   }

   public static <T> T createApi(Class<T> apiInterface) {
       // create method to generate API methods
       return getRetrofitInstance().create(apiInterface);
   }

}

 

The suggestions are fetched by calling getSuggestions after LoklakApi interface is implemented. getSuggestions returns an Observable of type SuggestData, which contains the suggestions in a List. For scraping tweets only a single query needs to be passed to LiquidCore, so flatmap is used to transform the observabe and then fromIterable operator is used to emit single queries as string to LiquidCore which then scrapes tweets, as implemented in fetchSuggestions

private Observable<String> fetchSuggestions() {
   LoklakApi loklakApi = RestClient.createApi(LoklakApi.class);
   Observable<SuggestData> observable = loklakApi.getSuggestions("", 2);
   return observable.flatMap(suggestData -> {
       List<Query> queryList = suggestData.getQueries();
       List<String> queries = new ArrayList<>();
       for (Query query : queryList) {
           queries.add(query.getQuery());
       }
       return Observable.fromIterable(queries);
   });
}

 

As LiquidCore uses callbacks to create a connection between NodeJS instance and Android, to maintain a flow of observables a custom observable is created using create operator which encapsulates the callbacks inside it. For a detail understanding of how LiquidCore event handling works, please go through the example. The way it is implemented in getScrapedTweets:

private Observable<ScrapedData> getScrapedTweets(final String query) {
   final String LC_TWITTER_URI = "android.resource://org.loklak.android.wok/raw/twitter";
   URI uri = URI.create(LC_TWITTER_URI);

   return Observable.create(emitter -> { // custom observable creation
       EventListener startEventListener = (service, event, payload) -> {
               service.emit(LC_QUERY_EVENT, query);
           service.emit(LC_FETCH_TWEETS_EVENT);
       };

       EventListener getTweetsEventListener = (service, event, payload) -> {
           ScrapedData scrapedData = mGson.fromJson(payload.toString(), ScrapedData.class);
           emitter.onNext(scrapedData); // data emitted using ObservableEmitter
       };

       MicroService.ServiceStartListener serviceStartListener = (service -> {
           service.addEventListener(LC_START_EVENT, startEventListener);
           service.addEventListener(LC_GET_TWEETS_EVENT, getTweetsEventListener);
       });

       MicroService microService = new MicroService(getActivity(), uri, serviceStartListener);
       microService.start();
   });
}

 

Now that we are getting suggestions and using them to get scraped tweets, this needs to be done periodically, so that tweets are pushed continuously to the loklak server. For this interval operator is used. A List is maintained which contains the suggestion queries based on which tweets are to be scraped. Once the scraping is done, the suggestion query is removed from the list when they are displayed in RecyclerView. And if the list is empty, then only a new set of suggestions are fetched.

Observable.interval(4, TimeUnit.SECONDS)
       .flatMap(this::getSuggestionsPeriodically)
       .flatMap(query -> {
           mSuggestionQuerries.add(query); // query added to list
           return getScrapedTweets(query);
       });

 

Method reference is used to maintain the modularity, so the logic of periodically fetching suggestions is implemented in getSuggestionsPeriodically

private Observable<String> getSuggestionsPeriodically(Long time) {
   if (mSuggestionQuerries.isEmpty()) { // checks if list is empty
       mInnerCounter = 0;
       return fetchSuggestions(); // new suggestions
   } else { // wait for a previous request to complete
       mInnerCounter++;
       if (mInnerCounter > 3) { // if some strange error occurs
           mSuggestionQuerries.clear();
       }
       return Observable.never(); // no observable is passed to subsriber, subscriber waits
   }
}

 

Now, it’s time to display the fetched tweets and then push the tweets to loklak server. When periodic fetching of suggestions was implemented we used interval operator and then flatMap to transform observables i.e. chaining network requests.

Till this point the observable we were creating were Cold Observable.Cold observables only emit values when a subscription is made. As we need to display scraped tweets and then push it, i.e. one source of observables and two (multiple) subscribers. By intuition the observable should be subscribed two times, for example:

Observable observable = Observable.interval(4, TimeUnit.SECONDS)
       .flatMap(this::getSuggestionsPeriodically)
       .flatMap(query -> {
           mSuggestionQuerries.add(query);
           return getScrapedTweets(query);
       });

// first time subscription
observable
       .subscribeOn(Schedulers.io())
       .observeOn(AndroidSchedulers.mainThread())
       .subscribe(
	// display in RecyclerVIew
       );


// second time subscription
observable
       .flatMap(// trnasformations to push data to server)
       .subscribeOn(Schedulers.io())
       .observeOn(AndroidSchedulers.mainThread())
       .subscribe(
	// display in number of tweets pushed
       );

 

But the source observable is cold observable i.e. it emits objects when it is subscribed to, due to which there will be two different network calls, one for first subscription and one for second subscription. So, both the subscriptions will have different data, which is not what is desired. The expected result is that there should be a single network call, and the data obtained from that call should be displayed and pushed to loklak server.

For this, hot Observables are used. Hot observables start emitting objects the moment they are created, irrespective of whether they are subscribed or not.

A cold observable can be converted to a hot observable by using publish operator and it starts emitting objects when connect operator is used. This is implemented in displayAndPostScrapedData:

ConnectableObservable<ScrapedData> observable = Observable.interval(4, TimeUnit.SECONDS)
           .flatMap(this::getSuggestionsPeriodically)
           .flatMap(query -> {
               mSuggestionQuerries.add(query);
               return getScrapedTweets(query);
           })
           .retry(2)
           .publish();

   // first time subscription to display scraped data
   Disposable viewDisposable = observable
           .subscribeOn(Schedulers.io())
           .observeOn(AndroidSchedulers.mainThread())
           .subscribe(
                   this::displayScrapedData,
                   this::setNetworkErrorView
           );
   mCompositeDisposable.add(viewDisposable);
  
   // second time subscription for pushing data to loklak
   Disposable pushDisposable = observable
           .flatMap(this::pushScrapedData) // scraped data transformed for pushing
           .subscribeOn(Schedulers.io())
           .observeOn(AndroidSchedulers.mainThread())
           .subscribe(
                   push -> {
                       mHarvestedTweets += push.getRecords();
                       harvestedTweetsCountTextView.setText(String.valueOf(mHarvestedTweets));
                   },
                   throwable -> {}
           );
   mCompositeDisposable.add(pushDisposable);

   Disposable publishDisposable = observable.connect(); // hot observable starts emitting
   mCompositeDisposable.add(publishDisposable);
}

 

The two subscriptions are made before connect operator is invoked because the hot observable emits objects due to successful network calls and network calls can’t be done on MainThread (UI Thread). So, doing the subscription before, channels the network calls to a background thread.

The scraped data is converted to JSON from objects using Gson, the JSON is converted to string and then using push API endpoint it is posted to loklak server. This is implemented in pushScrapedData method, which is used in second subscription by using method referencing.

private Observalbe<Push> pushScrapedData(ScrapedData scrapedData) throws Exception{    
    LoklakApi loklakApi = RestClient.createApi(LoklakApi.class);
    List<Status> statuses = scrapedData.getStatuses();
    String data = mGson.toJson(statuses);
    JSONArray jsonArray = new JSONArray(data);
    JSONObject jsonObject = new JSONObject();
    jsonObject.put("statuses", jsonArray);
    return loklakApi.pushTweetsToLoklak(jsonObject.toString());
}

 

Method reference for displayScrapedData and setNetworkErrorView methods are used to display the scraped data and handle unsuccessful network requests.

Only 80 tweets are preserved in RecyclerView. If number of tweets exceeds 80, then old tweets are removed.

private void displayScrapedData(ScrapedData scrapedData) {
   String query = scrapedData.getQuery();
   List<Status> statuses = scrapedData.getStatuses();
   mSuggestionQuerries.remove(query);
   if (mHarvestedTweetAdapter.getItemCount() > 80) {
       mHarvestedTweetAdapter.clearAdapter(); // old tweets removed
   }
   mHarvestedTweetAdapter.addHarvestedTweets(statuses);
   int count = mHarvestedTweetAdapter.getItemCount() - 1;
   recyclerView.scrollToPosition(count);
}

 

In case of a network error, the visibility of RecyclerView and TextView (which shows number of tweets pushed) is changed to gone and a message is displayed that there is network error.

private void setNetworkErrorView(Throwable throwable) {
   Log.e(LOG_TAG, throwable.toString());
   // recyclerView and TextView showing count of harvested tweets are hidden
   ButterKnife.apply(networkViews, GONE);
   // network error message displayed
   networkErrorTextView.setVisibility(View.VISIBLE);
}

References

Resources

Continue ReadingPosting Scraped Tweets to Loklak server from Loklak Wok Android

Realm database in Loklak Wok Android for Persistent view

Loklak Wok Android provides suggestions for tweet searches. The suggestions are stored in local database to provide a persistent view, resulting in a better user experience. The local database used here is Realm database instead of sqlite3 which is supported by Android SDK. The proper way to use an sqlite3 database is to first create a contract where the schema of the database is defined, then a database helper class which extends from SQLiteOpenHelper class where the schema is created i.e. tables are created and finally write ContentProvider so that you don’t have to write long SQL queries every time a database operation needs to be performed. This is just a lot of hard work to do, as this includes a lot of steps, debugging is also difficult. A solution to this can be using an ORM that provides a simple API to use sqlite3, but the currently available ORMs lack in terms of performance, they are too slow. A reliable solution to this problem is realm database, which is faster than raw sqlite3 and has really simple API for database operations. This blog explains the use of realm database for storing tweet search suggestions.

Adding Realm database to Android project

In project level build.gradle

buildscript {
   repositories {
       jcenter()
   }
   dependencies {
       classpath 'com.android.tools.build:gradle:2.3.3'
       classpath "io.realm:realm-gradle-plugin:3.3.1"

       // NOTE: Do not place your application dependencies here; they belong
       // in the individual module build.gradle files
   }
}

 

And at the top of app/build.gradle “apply plugin: ‘realm-android'”  is added.

Using Realm Database

Let’s start with a simple example. We have a Student class that has only two attributes name and age. To create the model for the database, the Student class is simply extended to RealmObject.

public class Student extends RealmObject {

   private String name;
   private int age;

   // A constructor needs to be explicitly defined, be it an empty constructor
   public Student(String name, int age) {
       this.name = name;
       this.age = age;
   }

   // getters and setters
}

 

To push data to the database, Java objects are created, a transaction is initialized, then copyToRealm method is used to push the data and finally the transaction is committed. But before all this, the database is initialized and a Realm instance is obtained.

Realm.init(context); // Database initialized
Realm realm = Realm.getDefaultInstance(); // realm instance obtained
      
Student student = new Student("Rahul Dravid", 22); // Simple java object created
realm.beginTransaction() // initialization of transaction
realm.copyToRealm(student); // pushed to database
realm.commitTransaction(); // transaction committed

 

copyToRealm takes only a single parameter, the parameter can be an object or an Iterable. Off course, the passed parameter should extend RealmObject. A List of Student can be passed as a parameter to copyToRealm to push multiple data into the database.

The above way of inserting data is synchronous. Realm also supports asynchronous transactions, you guessed it right, you don’t have to depend on AsyncTaskLoader. The same operation can be performed asynchronously as

realm.executeTransaction(new Realm.Transaction() { 
    @Override public void execute(Realm realm) {
        Student student = new Student("Rahul Dravid", 22);
        realm.copyToRealm(student);
    }
});

 

Now, querying the database is as easy as inserting.

RealmResults<Student> studentList = realm.where(Student.class).findAll();

 

No, transaction is required as we are not manipulating the database, as data is just read from the database. RealmResults extend Java List, so List methods which don’t manipulate the List can be used on studentList e.g. get(int index) to obtain object at the index.

The result can also be a filtered one, for example filtering students who are 22 years old.

RealmResults<Student> studentList = realm.where(Student.class).equalTo(age, 22).findAll();

 

Now, removing data from database. deleteAllFromRealm method can be executed on the obtained RealmResults or to completely remove data of a model class delete(Model.class) method on the realm instance is invoked. The operations should be enclosed between beginTransaction and commitTransaction if synchronous behaviour is required else for asynchronous behaviour the operation is done in execute method of an anonymous object of Realm.Transaction.

studentList.deleteAllFromRealm(); // removes the filtered result
realm.delete(Student.class); // removes all data of model class

 

Storing Tweet Search Suggestions in Loklak Wok Android for Persistent view

Loklak Wok Android uses Retrofit2 for sending network requests, for which POJO classes are already created so that it becomes easy for parsing the obtained JSON from network request. Due to this using Realm database becomes more easier, as the defined POJOs can be simply extended to RealmObject to create the model class of the data e.g. Query class extends RealmObject, one of the attribute is the suggestion query i.e. mQuery.

The database is initialized in LoklakWokApplication, the application class, this way the database is initialized only once which persists throughout the app lifecycle.

@Override
public void onCreate() {
   super.onCreate();
   Realm.init(this);
   RealmConfiguration realmConfiguration = new RealmConfiguration.Builder()
           .name(Realm.DEFAULT_REALM_NAME)
           .deleteRealmIfMigrationNeeded()
           .build();
   Realm.setDefaultConfiguration(realmConfiguration);
}

 

deleteRealmIfMigrationNeeded removes the old realm database and creates a new one if any of the Model class get changed i.e. an attribute is removed, added or simply the name of attribute is changed. This is done as we are not storing user generated data, we are just using database to provide persistent view. So, the previously kept data is not important.

The database is closed in onTerminate callback of the application

@Override
public void onTerminate() {
   Realm.getDefaultInstance().close();
   super.onTerminate();
}

 

Now that database is initialized. We fetch the previously stored data and display it in RecyclerView. If the network request is successful the queries from database are replaced by the queries fetched in network request, else the queries from database are displayed providing a persistent view. The way it is implemented in onCreateView of SuggestFragment

mRealm = Realm.getDefaultInstance();
...
// old queries obtained from database
RealmResults<Query> queryRealmResults = mRealm.where(Query.class).findAll();
List<Query> queries = mRealm.copyFromRealm(queryRealmResults);
// RecyclerView adapter created with old queries
mSuggestAdapter = new SuggestAdapter(queries, this);
tweetSearchSuggestions.setLayoutManager(new LinearLayoutManager(getActivity()));
tweetSearchSuggestions.setAdapter(mSuggestAdapter);

 

The old queries are replaced in onSuccessfulRequest

private void onSuccessfulRequest(SuggestData suggestData) {
   // suggestData contains suggestion queries
   if (suggestData != null) {
       // old queries replaced with new ones
       mSuggestAdapter.setQueries(suggestData.getQueries());
   }
   setAfterRefreshingState();
}

 

Now suggestion queries needs to be inserted into the database. Only the latest suggestions are inserted i.e. queries present when onStop lifecycle method of fragment is called, and as previous queries are not needed anymore, they are deleted. The operation is performed in a synchronous way.

@Override
public void onStop() {
   ...
   mRealm.beginTransaction();
   // old queries deleted
   mRealm.delete(Query.class);
   // new queries inserted
   mRealm.copyToRealm(mSuggestAdapter.getQueries());
   mRealm.commitTransaction();
   // fragment lifecycle called i.e. a new fragment/activity opens
   super.onStop();
}

 

Conclusion: Sqlite3 and Realm comparison

Operations Sqlite3 Realm
Table creation CREATE TABLE … extends RealmObject
Inserting data INSERT INTO … copyToRealm
Searching data SELECT … realm.where(Model.class)
Deleting data DELETE FROM … realmResults.deleteAllFromRealm() or realm.delete(Model.class)

Resources:

 

 

Continue ReadingRealm database in Loklak Wok Android for Persistent view