Using Elasticsearch Aggregations to Analyse Classifier Data in loklak Server
Loklak uses Elasticsearch to index Tweets and other social media entities. It also houses a classifier that classifies Tweets based on emotion, profanity and language. But earlier, this data was available only with the search API and there was no way to get aggregated data out of it. So as a part of my GSoC project, I proposed to introduce a new API endpoint which would allow users to access aggregated data from these classifiers.
In this blog post, I will be discussing how aggregations are performed on the Elasticsearch index of Tweets in the loklak server.
Structure of index
The ES index for Twitter is called messages and it has 3 fields related to classifiers –
- classifier_emotion
- classifier_language
- classifier_profanity
With each of these classifiers, we also have a probability attached which represents the confidence of the classifier for assigned class to a Tweet. The name of these fields is given by suffixing the emotion field by _probability (e.g. classifier_emotion_probability).
Since I will also be discussing aggregation based on countries in this blog post, there is also a field named place_country_code which saves the ISO 3166-1 alpha-2 code for the country of creation of Tweet.
Requesting aggregations using Elasticsearch Java API
Elasticsearch comes with a simple Java API which can be used to perform any desired task. To work with data, we need an ES client which can be built from a ES Node (if creating a cluster) or directly as a transport client (if connecting remotely to a cluster) –
// Transport client TransportClient tc = TransportClient.builder() .settings(mySettings) .build(); // From a node Node elasticsearchNode = NodeBuilder.nodeBuilder() .local(false).settings(mySettings) .node(); Client nc = elasticsearchNode.client();
[SOURCE]
Once we have a client, we can use ES AggregationBuilder to get aggregations from an index –
SearchResponse response = elasticsearchClient.prepareSearch(indexName) .setSearchType(SearchType.QUERY_THEN_FETCH) .setQuery(QueryBuilders.matchAllQuery()) // Consider every row .setFrom(0).setSize(0) // 0 offset, 0 result size (do not return any rows) .addAggregation(aggr) // aggr is a AggregatoinBuilder object .execute().actionGet(); // Execute and get results
[SOURCE]
AggregationBuilders are objects that define the properties of an aggregation task using ES’s Java API. This code snippet is applicable for any type of aggregation that we wish to perform on an index, given that we do not want to fetch any rows as a response.
Performing simple aggregation for a classifier
In this section, I will discuss the process to get results from a given classifier in loklak’s ES index. Here, we will be targeting a class-wise count of rows and stats (average and sum) of probabilities.
Writing AggregationBuilder
An AggregationBuilder for this task will be a Terms AggregationBuilder which would dynamically generate buckets for all the different values of fields for a given field in index –
AggregationBuilder getClassifierAggregationBuilder(String classifierName) { String probabilityField = classifierName + "_probability"; return AggregationBuilders.terms("by_class").field(classifierName) .subAggregation( AggregationBuilders.avg("avg_probability").field(probabilityField) ) .subAggregation( AggregationBuilders.sum("sum_probability").field(probabilityField) ); }
[SOURCE]
Here, the name of aggregation is passed as by_class. This will be used while processing the results for this aggregation task. Also, sub-aggregation is used to get average and sum probability by the name of avg_probability and sum_probability respectively. There is no need to specify to count rows as this is done by default.
Processing results
Once we have executed the aggregation task and received the SearchResponse as sr (say), we can use the name of top level aggregation to get Terms aggregation object –
Terms aggrs = sr.getAggregations().get("by_class");
After that, we can iterate through the buckets to get results –
for (Terms.Bucket bucket : aggrs.getBuckets()) { String key = bucket.getKeyAsString(); long docCount = bucket.getDocCount(); // Number of rows // Use name of sub aggregations to get results Sum sum = bucket.getAggregations().get("sum_probability"); Avg avg = bucket.getAggregations().get("avg_probability"); // Do something with key, docCount, sum and avg }
[SOURCE]
So in this manner, results from aggregation response can be processed.
Performing nested aggregations for different countries
The previous section described the process to perform aggregation over a single field. For this section, we’ll aim to get results for each country present in the index given a classifier field.
Writing a nested aggregation builder
To get the aggregation required, AggregationBuilder from previous section can be added as a sub-aggregation for the AggregationBuilder for country code field –
AggregationBuilder aggrs = AggregationBuilders.terms("by_country").field("place_country_code") .subAggregation( AggregationBuilders.terms("by_class").field(classifierName) .subAggregation( AggregationBuilders.avg("avg_probability").field(probabilityField) ) .subAggregation( AggregationBuilders.sum("sum_probability").field(probabilityField) ); );
[SOURCE]
Processing the results
Again, we can get the results by processing the AggregationBuilders by name in a top-to-bottom fashion –
Terms aggrs = response.getAggregations().get("by_country"); for (Terms.Bucket bucket : aggr.getBuckets()) { String countryCode = bucket.getKeyAsString(); Terms classAggrs = bucket.getAggregations().get("by_class"); for (Terms.Bucket classBucket : classAggr.getBuckets()) { String key = classBucket.getKeyAsString(); long docCount = classBucket.getDocCount(); Sum sum = classBucket.getAggregations().get("sum_probability"); Avg avg = classBucket.getAggregations().get("avg_probability"); ... } ... }
[SOURCE]
And we have the data about classifier for each country present in the index.
Conclusion
This blog post explained about Elasticsearch aggregations and their usage in the loklak server project. The changes discussed here were introduced over a series of patches to ElasticsearchClient.java by @singhpratyush (me).
Resources
- Elasticsearch aggregations with Java API – https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-aggs.html.
- Classifier servlet in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/api/aggregation/ClassifierServlet.java.
- Counting in ElasticSearch – http://tech.opentable.co.uk/blog/2013/09/11/counting-in-elastic-search/
- An Introduction to Elasticsearch Aggregations – https://qbox.io/blog/elasticsearch-aggregations