Multithreading implementation in Loklak Server
Loklak Server is a near-realtime system. It performs a large number of tasks and are very costly in terms of resources. Its basic function is to scrape all data from websites and output it at the endpoint. In addition to scraping data, there is also a need to perform other tasks like refining and cleaning of data. That is why, multiple threads are instantiated. They perform other tasks like:
-
Refining of data and extract more data
The data fetched needs to be cleaned and refined before outputting it. Some of the examples are:
a) Removal of html tags from tweet text:
After extracting text from html data and feeding to TwitterTweet object, it concurrently runs threads to remove all html from text.
b) Unshortening of url links:
The url links embedded in the tweet text may track the users with the help of shortened urls. To prevent this issue, a thread is instantiated to unshorten the url links concurrently while cleaning of tweet text.
-
Indexing all JSON output data to ElasticSearch
While extracting JSON data as output, there is a method here in Timeline.java that indexes data to ElasticSearch.
Managing multithreading
To manage multithreading, Loklak Server applies following objects:
1. ExecutorService
To deal with large numbers of threads ExecutorService object is used to handle threads as it helps JVM to prevent any resource overflow. Thread’s lifecycle can be controlled and its creation cost can be optimized. This is the best example of ExecutorService application is here:
. . public class TwitterScraper { // Creation of at max 40 threads. This sets max number of threads to 40 at a time public static final ExecutorService executor = Executors.newFixedThreadPool(40); . . . . // Feeding of TwitterTweet object with data TwitterTweet tweet = new TwitterTweet( user.getScreenName(), Long.parseLong(tweettimems.value), props.get("tweettimename").value, props.get("tweetstatusurl").value, props.get("tweettext").value, Long.parseLong(tweetretweetcount.value), Long.parseLong(tweetfavouritecount.value), imgs, vids, place_name, place_id, user, writeToIndex, writeToBackend ); // Starting thread to refine TwitterTweet data if (tweet.willBeTimeConsuming()) { executor.execute(tweet); } . . .
2. basic Thread class
Thread class can also be used instead of ExecutorService in cases where there is no resource crunch. But it is always suggested to use ExecutorService due to its benefits. Thread implementation can be used as an anonymous class like here.
3. Runnable interface
Runnable interface can be used to create an anonymous class or classes which does more task than just a task concurrently. In Loklak Server, TwitterScraper concurrently indexes the data to ElasticSearch, unshortens link and cleans data. Have a look at implementation here.
Resources:
- Loklak Server: https://github.com/loklak/loklak_server
- ExecutorService Class: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html
- MultiThreading: https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)
- RedirectUnshortener: https://github.com/loklak/loklak_server/blob/0f055ea6d2d768ea13b29c6fee20ab95902d70ab/src/org/loklak/harvester/RedirectUnshortener.java
- Threads vs ExecutorService: https://stackoverflow.com/questions/26938210/executorservice-vs-casual-thread-spawner