Introducing Priority Kaizen Harvester for loklak server
In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak. Background, motivation and approach Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester. The produce consumer imbalance in Kaizen harvester Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156) Learnings from attempt to add blocking queue for queries As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them - public class BlockingKaizenHarvester extends KaizenHarvester { ... public BlockingKaizenHarvester() { super(new KaizenQueries() { ... private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize); @Override public boolean addQuery(String query) { if (this.queries.contains(query)) { return false; } try { this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS); return true; } catch (InterruptedException e) { DAO.severe("BlockingKaizen Couldn't add query: " + query, e); return false; } } @Override public String getQuery() { try { return this.queries.take(); } catch (InterruptedException e) { DAO.severe("BlockingKaizen Couldn't get any query", e); return null; } } ... }); } } [SOURCE, loklak/loklak_server#1210] But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted). Therefore, using a blocking queue in KaizenQueries is not a good thing to do. Other considerations After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues. Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how…
