Caching Elasticsearch Aggregation Results in loklak Server
To provide aggregated data for various classifiers, loklak uses Elasticsearch aggregations. Aggregated data speaks a lot more than a few instances from it can say. But performing aggregations on each request can be very resource consuming. So we needed to come up with a way to reduce this load.
In this post, I will be discussing how I came up with a caching model for the aggregated data from the Elasticsearch index.
Fields to Consider while Caching
At the classifier endpoint, aggregations can be requested based on the following fields –
- Classifier Name
- Classifier Classes
- Countries
- Start Date
- End Date
But to cache results, we can ignore cases where we just require a few classes or countries and store aggregations for all of them instead. So the fields that will define the cache to look for will be –
- Classifier Name
- Start Date
- End Date
Type of Cache
The data structure used for caching was Java’s HashMap. It would be used to map a special string key to a special object discussed in a later section.
Key
The key is built using the fields mentioned previously –
private static String getKey(String index, String classifier, String sinceDate, String untilDate) { return index + "::::" + classifier + "::::" + (sinceDate == null ? "" : sinceDate) + "::::" + (untilDate == null ? "" : untilDate); }
[SOURCE]
In this way, we can handle requests where a user makes a request for every class there is without running the expensive aggregation job every time. This is because the key for such requests will be same as we are not considering country and class for this purpose.
Value
The object used as key in the HashMap is a wrapper containing the following –
- json – It is a JSONObject containing the actual data.
- expiry – It is the expiry of the object in milliseconds.
class JSONObjectWrapper { private JSONObject json; private long expiry; ... }
Timeout
The timeout associated with a cache is defined in the configuration file of the project as “classifierservlet.cache.timeout”. It defaults to 5 minutes and is used to set the eexpiryof a cached JSONObject –
class JSONObjectWrapper { ... private static long timeout = DAO.getConfig("classifierservlet.cache.timeout", 300000); JSONObjectWrapper(JSONObject json) { this.json = json; this.expiry = System.currentTimeMillis() + timeout; } ... }
Cache Hit
For searching in the cache, the previously mentioned string is composed from the parameters requested by the user. Checking for a cache hit can be done in the following manner –
String key = getKey(index, classifier, sinceDate, untilDate); if (cacheMap.keySet().contains(key)) { JSONObjectWrapper jw = cacheMap.get(key); if (!jw.isExpired()) { // Do something with jw } } // Calculate the aggregations ...
But since jw here would contain all the data, we would need to filter out the classes and countries which are not needed.
Filtering results
For filtering out the parts which do not contain the information requested by the user, we can perform a simple pass and exclude the results that are not needed.
Since the number of fields to filter out, i.e. classes and countries, would not be that high, this process would not be that resource intensive. And at the same time, would save us from requesting heavy aggregation tasks from the user.
Since the data about classes is nested inside the respective country field, we need to perform two level of filtering –
JSONObject retJson = new JSONObject(true); for (String key : json.keySet()) { JSONArray value = filterInnerClasses(json.getJSONArray(key), classes); if ("GLOBAL".equals(key) || countries.contains(key)) { retJson.put(key, value); } }
Cache Miss
In the case of a cache miss, the helper functions are called from ElasticsearchClient.java to get results. These results are then parsed from HashMap to JSONObject and stored in the cache for future usages.
JSONObject freshCache = getFromElasticsearch(index, classifier, sinceDate, untilDate); cacheMap.put(key, new JSONObjectWrapper(freshCache));
The getFromElasticsearch method finds all the possible classes and makes a request to the appropriate method in ElasticsearchClient, getting data for all classifiers and all countries.
Conclusion
In this blog post, I discussed the need for caching of aggregations and the way it is achieved in the loklak server. This feature was introduced in pull request loklak/loklak_server#1333 by @singhpratyush (me).
Resources
- Using Elasticsearch Aggregations to Analyse Classifier Data in loklak Server – https://blog.fossasia.org/using-elasticsearch-aggregations-to-analyse-classifier-data-in-loklak-server/
- High-Cardinality Memory Implications in Elasticsearch aggregations – https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html#_high_cardinality_memory_implications
- An Introduction to Elasticsearch Aggregations – https://qbox.io/blog/elasticsearch-aggregations
- Classifier servlet in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/api/aggregation/ClassifierServlet.java.