Developing LoklakWordCloud app for Loklak apps site

LoklakWordCloud app is an app to visualise data returned by loklak in form of a word cloud. The app is presently hosted on Loklak apps site. Word clouds provide a very simple, easy, yet interesting and effective way to analyse and visualise data. This app will allow users to create word cloud out of twitter data via Loklak API. Presently the app is at its very early stage of development and more work is left to be done. The app consists of a input field where user can enter a query word and on pressing search button a word cloud will be generated using the words related to the query word entered. Loklak API is used to fetch all the tweets which contain the query word entered by the user. These tweets are processed to generate the word cloud. Related issue: https://github.com/fossasia/apps.loklak.org/pull/279 Live app: http://apps.loklak.org/LoklakWordCloud/ Developing the app The main challenge in developing this app is implementing its prime feature, that is, generating the word cloud. How do we get a dynamic word cloud which can be easily generated by the user based on the word he has entered? Well, here comes in Jqcloud. An awesome lightweight Jquery plugin for generating word clouds. All we need to do is provide list of words along with their weights. Let us see step by step how this app (first version) works. First we require all the tweets which contain the entered word. For this we use Loklak search service. Once we get all the tweets, then we can parse the tweet body to create a list of words along with their frequency. var url = "http://35.184.151.104/api/search.json?callback=JSON_CALLBACK&count=100&q=" + query; $http.jsonp(url) .then(function (response) { $scope.createWordCloudData(response.data.statuses); $scope.tweet = null; }); Once we have all the tweets, we need to extract the tweet texts and create a list of valid words. What are valid words? Well words like ‘the’, ‘is’, ‘a’, ‘for’, ‘of’, ‘then’, does not provide us with any important information and will not help us in doing any kind of analysis. So there is no use of including them in our word cloud. Such words are called stop words and we need to get rid of them. For this we are using a list of commonly used stop words. Such lists can be very easily found over the internet. Here is the list which we are using. Once we are able to extract the text from the tweets, we need to filter stop words and insert the valid words into a list. tweet = data[i]; tweetWords = tweet.text.replace(", ", " ").split(" "); for (var j = 0; j < tweetWords.length; j++) { word = tweetWords[j]; word = word.trim(); if (word.startsWith("'") || word.startsWith('"') || word.startsWith("(") || word.startsWith("[")) { word = word.substring(1); } if (word.endsWith("'") || word.endsWith('"') || word.endsWith(")") || word.endsWith("]") || word.endsWith("?") || word.endsWith(".")) { word = word.substring(0, word.length - 1); } if (stopwords.indexOf(word.toLowerCase()) !== -1) { continue; } if (word.startsWith("#") || word.startsWith("@")) { continue; } if (word.startsWith("http") || word.startsWith("https")) { continue; } $scope.filteredWords.push(word); }…

Continue ReadingDeveloping LoklakWordCloud app for Loklak apps site

Advanced functionality in SUSI Tweetbot

SUSI AI is integrated to Twitter (blog). During the initial phase, SUSI Tweetbot had basic UI and functionalities like just “plain text” replies. Twitter provides with many more features like quick replies i.e. presenting to the user with some choices to choose from or visiting SUSI server repository by just clicking buttons during the chat etc. All these features are provided to enhance the user experience with our chatbot on Twitter. This blog post walks you through on adding these functionalities to the SUSI Tweetbot: Quick replies Buttons Quick replies: This feature provides options to the user to choose from. The user doesn’t need to type the next query but rather select a quick reply from the options available. This speeds up the process and makes it easy for the user. Also, it helps developers know all the possible queries which can come next, from the user. Hence, it helps in efficient coding on how to handle those queries.In SUSI Tweetbot this feature is used to welcome a new user to the SUSI A.I.’s chat window, as shown in the image above. The user can select any option among “Get started” and “Start chatting”.The “Get started” option is basically for introduction of SUSI A.I. to the user. While, “Start chatting” when clicked shows the user of what all queries the user can try.Let’s come to the code part on how to show these options and what events happen when a user selects one of the options. To show the Welcome message, we call SUSI API with the query as string “Welcome” and store the reply in message variable. The code snippet used: var queryUrl = 'http://api.susi.ai/susi/chat.json?q=Welcome'; var message = ''; request({ url: queryUrl, json: true }, function (err, response, data) { if (!err && response.statusCode === 200) { message = data.answers[0].actions[0].expression; } else { // handle error } }); To show options with the message: var msg = { "welcome_message" : { "message_data": { "text": message, "quick_reply": { "type": "options", "options": [ { "label": "Get started", "metadata": "external_id_1" }, { "label": "Start chatting", "metadata": "external_id_2" } ] } } } }; T.post('direct_messages/welcome_messages/new', msg, sent); The line T.post() makes a POST request to the Twitter API, to register the welcome message with Twitter for our chatbot. The return value from this request includes a welcome message id in it corresponding to this welcome message. We set up a welcome message rule for this welcome message using it’s id. By setting up the rule is to set this welcome message as the default welcome message shown to new users. Twitter also provides with custom welcome messages, information about which can be found in their official docs. The welcome message rule is set up by sending the welcome message id as a key in the request body: var welcomeId = data.welcome_message.id; var welcomeRule = { "welcome_message_rule": { "welcome_message_id": welcomeId } }; T.post('direct_messages/welcome_messages/rules/new', welcomeRule, sent); Now, we are all set to show the new users with a welcome message. Buttons: Let’s go a bit…

Continue ReadingAdvanced functionality in SUSI Tweetbot

Scraping Concurrently with Loklak Server

At Present, SearchScraper in Loklak Server uses numerous threads to scrape Twitter website. The data fetched is cleaned and more data is extracted from it. But just scraping Twitter is under-performance. Concurrent scraping of other websites like Quora, Youtube, Github, etc can be added to diversify the application. In this way, single endpoint search.json can serve multiple services. As this Feature is under-refinement, We will discuss only the basic structure of the system with new changes. I tried to implement more abstract way of Scraping by:- 1) Fetching the input data in SearchServlet Instead of selecting the input get-parameters and referencing them to be used, Now complete Map object is referenced, helping to be able to add more functionality based on input get-parameters. The dataArray object (as JSONArray) is fetched from DAO.scrapeLoklak method and is embedded in output with key results // start a scraper inputMap.put("query", query); DAO.log(request.getServletPath() + " scraping with query: " + query + " scraper: " + scraper); dataArray = DAO.scrapeLoklak(inputMap, true, true);   2) Scraping the selected Scrapers concurrently In DAO.java, the useful get parameters of inputMap are fetched and cleaned. They are used to choose the scrapers that shall be scraped, using getScraperObjects() method. Timeline2.Order order= getOrder(inputMap.get("order")); Timeline2 dataSet = new Timeline2(order); List<String> scraperList = Arrays.asList(inputMap.get("scraper").trim().split("\\s*,\\s*"));   Threads are created to fetch data from different scrapers according to size of list of scraper objects fetched. input map is passed as argument to the scrapers for further get parameters related to them and output data according to them. List<BaseScraper> scraperObjList = getScraperObjects(scraperList, inputMap); ExecutorService scraperRunner = Executors.newFixedThreadPool(scraperObjList.size()); try{ for (BaseScraper scraper : scraperObjList) { scraperRunner.execute(() -> { dataSet.mergePost(scraper.getData()); }); } } finally { scraperRunner.shutdown(); try { scraperRunner.awaitTermination(24L, TimeUnit.HOURS); } catch (InterruptedException e) { } }   3) Fetching the selected Scraper Objects in DAO.java Here the variable of abstract class BaseScraper (SuperClass of all search scrapers) is used to create List of scrapers to be scraped. All the scrapers' constructors are fed with input map to be scraped accordingly. List<BaseScraper> scraperObjList = new ArrayList<BaseScraper>(); BaseScraper scraperObj = null; if (scraperList.contains("github") || scraperList.contains("all")) { scraperObj = new GithubProfileScraper(inputMap); scraperObjList.add(scraperObj); } . . .   References: Best practices of Multithreading in Java: https://stackoverflow.com/questions/17018507/java-multithreading-best-practice ExecutorService vs Casual Thread Spawner: https://stackoverflow.com/questions/26938210/executorservice-vs-casual-thread-spawner Basic Data Structures used in Java: https://www.eduonix.com/blog/java-programming-2/learn-to-implement-data-structures-in-java/

Continue ReadingScraping Concurrently with Loklak Server

Sharing Images on Twitter from Phimpme Android App Using twitter4j

As sharing an image to the social media platform is an important feature in Phimpme android. In my previous blog, I have explained how to authenticate the Android application with Twitter. In this blog, I will discuss how to upload an image directly on Twitter from the application after successfully logging to Twitter. To check if the application is authenticated to Twitter or not. When the application is successfully authenticated Twitter issues a Token which tells the application if it is connected to Twitter or not. In LoginActivity.java the function isActive returns a boolean value. True if the Twitter token is successfully issued or else false.   public static boolean isActive(Context ctx) {        SharedPreferences sharedPrefs = ctx.getSharedPreferences(AppConstant.SHARED_PREF_NAME, Context.MODE_PRIVATE);        return sharedPrefs.getString(AppConstant.SHARED_PREF_KEY_TOKEN, null) != null;    } We call isActive function from LoginActive class to check if the application is authenticated to Twitter or not. We call it before using the share function in sharingActivity: if (LoginActivity.isActive(context)) {                try {                    // Send Image function } catch (Exception ex) {                    Toast.makeText(context, "ERROR", Toast.LENGTH_SHORT).show();  } We have saved the image in the internal storage of the device and use saveFilePath to use the path of the saved image. In Phimpme we used HelperMethod class where our share function resides, and while the image is being shared an alert dialog box with spinner pops on the screen. Sending the image to HelperMethod class First, We need to get the image and convert it into Bitmaps. Since, the image captured by the phone camera is usually large to upload and it will take a lot of time we need to compress the Bitmap first. BitmapFactory.decodeFile(specify name of the file) is used to fetch the file and convert it into bitmap. To send the data we used FileOutStream to the set the path of the file or image in this case. Bitmap.compress method is used to compress the image to desired value and format. In Phimpme we are converting it into PNG.   Bitmap bmp = BitmapFactory.decodeFile(saveFilePath);                    String filename = Environment.getExternalStorageDirectory().toString() + File.separator + "1.png";                    Log.d("BITMAP", filename);                    FileOutputStream out = new FileOutputStream(saveFilePath);                    bmp.compress(Bitmap.CompressFormat.PNG, 90, out);                    HelperMethods.postToTwitterWithImage(context, ((Activity) context), saveFilePath, caption, new HelperMethods.TwitterCallback() {                        @Override                        public void onFinsihed(Boolean response) {                            mAlertBuilder.dismiss();                            Snackbar.make(parent, R.string.tweet_posted_on_twitter, Snackbar.LENGTH_LONG).show();                        } Post image function To post the image on Twitter we will use ConfigurationBuilder class. We will create a new object of the class and then attach Twitter consumer key, consumer secret key, Twitter access token, and twitter token secret. setOAuthConsumerKey() function is used to set the consumer key which is generated by the Twitter when creating the application in the Twitter development environment. Similarly, setOAuthConsumerSecret() function is used to set the consumer secret key. Specify the token key which generated after successfully connecting to twitter in setOAuthAcessToken() fuction and Token secret in setOAuthAcessTokenSecret() function.   ConfigurationBuilder configurationBuilder = new ConfigurationBuilder();       configurationBuilder.setOAuthConsumerKey(context.getResources().getString(R.string.twitter_consumer_key)); configurationBuilder.setOAuthConsumerSecret(context.getResources().getString(R.string.twitter_consumer_secret)); configurationBuilder.setOAuthAccessToken(LoginActivity.getAccessToken((context))); configurationBuilder.setOAuthAccessTokenSecret(LoginActivity.getAccessTokenSecret(context));        Configuration configuration = configurationBuilder.build(); final Twitter twitter = new TwitterFactory(configuration).getInstance(); Sending Image to twitter: The image is uploaded to twitter using statusUpdate class specified in Twitter4j API. Pass the image…

Continue ReadingSharing Images on Twitter from Phimpme Android App Using twitter4j

Integrating Twitter Authenticating using Twitter4j in Phimpme Android Application

We have used Twitter4j API to authenticate Twitter in Phimpme application. Below are the following steps in setting up the Twitter4j API in Phimpme and Login to Twitter from Phimpme android application. Setting up the environment Download the Twitter4j package from http://twitter4j.org/en/. For sharing images we will only need twitter4j-core-3.0.5.jar and twitter4j-media-support-3.0.5.jar files. Copy these files and save it in the libs folder of the application. Go to build.gradle and add the following codes in dependencies: dependencies { compile files('libs/twitter4j-core-3.0.5.jar') compile files('libs/twitter4j-media-support-3.0.5.jar') } Adding Phimpme application in Twitter development page Go to https://dev.twitter.com/->My apps-> Create new apps. Create an application window opens where we have to fill all the necessary details about the application. It is mandatory to fill all the fields. In website field, if you are making an android application then anything can be filled in website field for example www.google.com. But it is necessary to fill this field also. After filling all the details click on “Create your Twitter application” button. Adding Twitter Consumer Key and Secret Key This generates twitter consumer key and twitter secret key. We need to add this in our string.xml folder. <string name="twitter_consumer_key">ry1PDPXM6rwFVC1KhQ585bJPy</string> <string name="twitter_consumer_secret">O3qUqqBLinr8qrRvx3GXHWBB1AN10Ax26vXZdNlYlEBF3vzPFt</string> Twitter Authentication Make a new JAVA class say LoginActivity. Where we have to first fetch the twitter consumer key and Twitter secret key. private static Twitter twitter;    private static RequestToken requestToken;    @Override    protected void onCreate(Bundle savedInstanceState) {        super.onCreate(savedInstanceState);        setContentView(R.layout.activity_twitter_login);        twitterConsumerKey = getResources().getString(R.string.twitter_consumer_key);        twitterConsumerSecret = getResources().getString(R.string.twitter_consumer_secret);   We are using a web view to interact with the Twitter login page. twitterLoginWebView = (WebView)findViewById(R.id.twitterLoginWebView);        twitterLoginWebView.setBackgroundColor(Color.TRANSPARENT);        twitterLoginWebView.setWebViewClient( new WebViewClient(){            @Override            public boolean shouldOverrideUrlLoading(WebView view, String url){                if( url.contains(AppConstant.TWITTER_CALLBACK_URL)){                    Uri uri = Uri.parse(url);                    LoginActivity.this.saveAccessTokenAndFinish(uri);                    return true;                }                return false;            }              If the access Token is already saved then the user is already signed in or else it sends the Twitter consumer key and the Twitter secret key to gain access Token. ConfigurationBuilder function is used to set the consumer key and consumer secret key. ConfigurationBuilder configurationBuilder = new ConfigurationBuilder();        configurationBuilder.setOAuthConsumerKey(twitterConsumerKey);     configurationBuilder.setOAuthConsumerSecret(twitterConsumerSecret);        Configuration configuration = configurationBuilder.build();        twitter = new TwitterFactory(configuration).getInstance(); It is followed by the following Runnable thread to check if the request token is received or not. If authentication fails, an error Toast message pops. new Thread(new Runnable() {            @Override            public void run() {                try {                    requestToken = twitter.getOAuthRequestToken(AppConstant.TWITTER_CALLBACK_URL);                } catch (Exception e) {                    final String errorString = e.toString();                    LoginActivity.this.runOnUiThread(new Runnable() {                        @Override                        public void run() {                            mAlertBuilder.cancel();                            Toast.makeText(LoginActivity.this, errorString, Toast.LENGTH_SHORT).show();                            finish();                        }                    });                    return;                }                LoginActivity.this.runOnUiThread(new Runnable() {                    @Override                    public void run() {                        twitterLoginWebView.loadUrl(requestToken.getAuthenticationURL());                    }                });            }        }).start(); Conclusion It offers seamless integration of Twitter in any application. Without leaving actual application, easier to authenticate. Further, it is used to upload the photo to Twitter directly from Phimpme Android application, fetch profile picture and username. Github https://github.com/fossasia/phimpme-android Resources To create app https://dev.twitter.com/ To download the Twitter4j Package: http://twitter4j.org/en/ Youtube tutorial: https://www.youtube.com/watch?v=_IsBi3cpvio Blog Post: http://www.theappguruz.com/blog/android-twitter-integration-tutorial  

Continue ReadingIntegrating Twitter Authenticating using Twitter4j in Phimpme Android Application

Visualising Tweet Statistics in MultiLinePlotter App for Loklak Apps

MultiLinePlotter app is now a part of Loklak apps site. This app can be used to compare aggregations of tweets containing a particular query word and visualise the data for better comparison. Recently there has been a new addition to the app. A feature for showing tweet statistics like the maximum number of tweets (along with date) containing the given query word and the average number of tweets over a period of time. Such statistics is visualised for all the query words for better comparison. Related issue: https://github.com/fossasia/apps.loklak.org/issues/236 Obtaining Maximum number of tweets and average number of tweets Before visualising the statistics we need to obtain them. For this we simply need to process the aggregations returned by the Loklak API. Let us start with maximum number of tweets containing the given keyword. What we actually require is what is the maximum number of tweets that were posted and contained the user given keyword and on which date the number was maximum. For this we can use a function which will iterate over all the aggregations and return the largest along with date. $scope.getMaxTweetNumAndDate = function(aggregations) { var maxTweetDate = null; var maxTweetNum = -1; for (date in aggregations) { if (aggregations[date] > maxTweetNum) { maxTweetNum = aggregations[date]; maxTweetDate = date; } } return {date: maxTweetDate, count: maxTweetNum}; } The above function maintains two variables, one for maximum number of tweets and another for date. We iterate over all the aggregations and for each aggregation we compare the number of tweets with the value stored in the maxTweetNum variable. If the current value is more than the value stored in that variable then we simply update it and keep track of the date. Finally we return an object containing both maximum number of tweets and the corresponding date.Next we need to obtain average number of tweets. We can do this by summing up all the tweet frequencies and dividing it by number of aggregations. $scope.getAverageTweetNum = function(aggregations) { var avg = 0; var sum = 0; for (date in aggregations) { sum += aggregations[date]; } return parseInt(sum / Object.keys(aggregations).length); } The above function calculates average number of tweets in the way mentioned before the snippet. Next for every tweet we need to store these values in a format which can easily be understood by morris.js. For this we use a list and store the statistics values for individual query words as objects and later pass it as a parameter to morris. var maxStat = $scope.getMaxTweetNumAndDate(aggregations); var avg = $scope.getAverageTweetNum(aggregations); $scope.tweetStat.push({ tweet: $scope.tweet, maxTweetCount: maxStat.count, maxTweetOn: maxStat.date, averageTweetsPerDay: avg, aggregationsLength: Object.keys(aggregations).length }); We maintain a list called tweetStat and the list contains objects which stores the query word and the corresponding values. Apart from plotting these statistics, the app also displays the statistics when user clicks on an individual treat present in the search record section. For this we filter tweetStat list mentioned above and get the required object corresponding to the query word the user selected bind it to angular…

Continue ReadingVisualising Tweet Statistics in MultiLinePlotter App for Loklak Apps

Developing MultiLinePlotter App for Loklak

MultiLinePlotter is a web application which uses Loklak API under the hood to plot multiple tweet aggregations related to different user provided query words in the same graph. The user can give several query words and multiple lines for different queries will be plotted in the same graph. In this way, users will be able to compare tweet distribution for various keywords and visualise the comparison. All the searched queries are shown under the search record section. Clicking on a record causes a dialogue box to pop up where the individual tweets related to the query word is displayed. Users can also remove a series from the plot dynamically by just pressing the Remove button beside the query word in record section. The app is presently hosted on Loklak apps site. Related issue - https://github.com/fossasia/apps.loklak.org/issues/225 Getting started with the app Let us delve into the working of the app. The app uses Loklak aggregation API to get the data. A call to the API looks something like this: http://api.loklak.org/api/search.json?q=fossasia&source=cache&count=0&fields=created_at A small snippet of the aggregation returned by the above API request is shown below. "aggregations": {"created_at": { "2017-07-03": 3, "2017-07-04": 9, "2017-07-05": 12, "2017-07-06": 8, }} The API provides a nice date v/s number of tweets aggregation. Now we need to plot this. For plotting Morris.js has been used. It is a lightweight javascript library for visualising data. One of the main features of this app is addition and removal of multiple series from the graph dynamically. How do we achieve that? Well, this can be achieved by manipulating the morris.js data list whenever a new query is made. Let us understand this in steps. At first, the data is fetched using angular HTTP service. $http.jsonp('http://api.loklak.org/api/search.json?callback=JSON_CALLBACK', {params: {q: $scope.tweet, source: 'cache', count: '0', fields: 'created_at'}}) .then(function (response) { $scope.getData(response.data.aggregations.created_at); $scope.plotData(); $scope.queryRecords.push($scope.tweet); }); Once we get the data, getData function is called and the aggregation data is passed to it. The query word is also stored in queryRecords list for future use. In order to plot a line graph morris.js requires a data object which will contain the required values for a series. Given below is an example of such a data object. data: [ { x: '2006', a: 100, b: 90 }, { x: '2007', a: 75, b: 65 }, { x: '2008', a: 50, b: 40 }, { x: '2009', a: 75, b: 65 }, ], For every ‘x’, ‘a’ and ‘b’ will be plotted. Thus two lines will be drawn. Our app will also maintain a data list like the one shown above, however, in our case, the data objects will have a variable number of keys. One key will determine the ‘x’ value and other keys will determine the ordinates (number of tweets). All the data objects present in the data list needs to be updated whenever a new search is done. The getData function does this for us. var value = $scope.tweet; for (date in aggregations) { var present = false; for (var i = 0;…

Continue ReadingDeveloping MultiLinePlotter App for Loklak

Integration of SUSI AI in Twitter

We will be making a Susi messenger bot on Twitter. The messenger bot will tweet back to your tweets and reply instantly when you chat with it. Feel free to tweet to the already made SUSI AI account (mentioning @SusiAI1 in it). Follow it, to have a personal chat. Make a new account, which you want to use as the bot account. You can make one from sign up option from https://www.twitter.com. Prerequisites To create your account on -: 1. Twitter 2. Github 3. Heroku 4. Node js Setup your own Messenger Bot 1. Make a new app here, to know the access token and other properties for our application. These properties will help us communicate with Twitter. Click "modify the app permissions" link, as shown here: Select the Read, Write and Access direct messages option: Don't forget to click the update settings button at the bottom. Click the Generate My Access Token and Token Secret button. 3. Create a new heroku app here. This app will accept the requests from Twitter and Susi api. 4. Create a config variable by switching to settings page of your app.      The name of your first config variable should be HEROKU_URL and its value is the url address of the heroku app created by you.   The other config variables that need to be created will be these:   The corresponding names of these variables in the same order are:   i) Access token   ii) Access token secret   iii) Consumer key   iv) Consumer secret    We need to visit our app from here, the keys and access tokens tab will help us with the values of these variables. Let’s start with the code part of the integration of SUSI AI to Twitter. We will be using Node js to achieve this integration. First we need to require some packages: Now using the Twit module, we need to authenticate our requests, by using our environment variables as set up in step 4: Now let’s make a user stream: var stream = T.stream('user'); We will be using the capabilities of this stream, to catch events of getting tweeted or receiving a direct message by using: stream.on('tweet', functionToBeCalledWhenTweeted); stream.on('follow', functionToBeCalledWhenFollowed); stream.on('direct_message', functionToBeCalledWhenDirectMessaged); So, when a person tweets to our account like this: We can catch it with ‘tweet’ event and execute a set of instructions: stream.on('tweet', tweetEvent); function tweetEvent(eventMsg) { var replyto = eventMsg.in_reply_to_screen_name; // to store the message tweeted excluding '@SusiAI1' substring var text = eventMsg.text.substring(9); // to store the name of the tweeter var from = eventMsg.user.screen_name; if (replyto === 'SusiAI1') { var queryUrl = 'http://api.asksusi.com/susi/chat.json?q=' + encodeURI(text); var message = ''; request({ url: queryUrl, json: true }, function (err, response, data) { if (!err && response.statusCode === 200) { // fetching the answer from the data object returned message = data.answers[0].actions[0].expression + data; } else { message = 'Oops, Looks like Susi is taking a break'; console.log(err); } console.log(message); // If the message length is more than tweet limit if(message.length > 140){ tweetIt('@'…

Continue ReadingIntegration of SUSI AI in Twitter

Introducing Priority Kaizen Harvester for loklak server

In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak. Background, motivation and approach Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester. The produce consumer imbalance in Kaizen harvester Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156) Learnings from attempt to add blocking queue for queries As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them - public class BlockingKaizenHarvester extends KaizenHarvester {    ...    public BlockingKaizenHarvester() {        super(new KaizenQueries() {            ...            private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize);            @Override            public boolean addQuery(String query) {                if (this.queries.contains(query)) {                    return false;                }                try {                    this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS);                    return true;                } catch (InterruptedException e) {                    DAO.severe("BlockingKaizen Couldn't add query: " + query, e);                    return false;                }            }            @Override            public String getQuery() {                try {                    return this.queries.take();                } catch (InterruptedException e) {                    DAO.severe("BlockingKaizen Couldn't get any query", e);                    return null;                }            }            ...        });    } } [SOURCE, loklak/loklak_server#1210] But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted). Therefore, using a blocking queue in KaizenQueries is not a good thing to do. Other considerations After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues. Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how…

Continue ReadingIntroducing Priority Kaizen Harvester for loklak server

Fetching URL for Embedded Twitter Videos in loklak server

The primary web service that loklak scrapes is Twitter. Being a news and social networking service, Twitter allows its users to post videos directly to Twitter and they convey more thoughts than what text can. But for an automated scraper, getting the links is not a simple task. Let us see that what were the problems we faced with videos and how we solved them in the loklak server project. Previous setup and embedded videos In the previous version of loklak server, the TwitterScraper searched for videos in 2 ways - Youtube links HTML5 video links To fetch the video URL from HTML5 video, following snippet was used - if ((p = input.indexOf("<source video-src")) >= 0 && input.indexOf("type=\"video/") > p) {    String video_url = new prop(input, p, "video-src").value;    videos.add    continue; } Here, input is the current line from raw HTML that is being processed and prop is a class defined in loklak that is useful in parsing HTML attributes. So in this way, the HTML5 videos were extracted. The Problem - Embedded videos Though the previous setup had no issues, it was useless as Twitter embeds the videos in an iFrame and therefore, can’t be fetched using simple HTML5 tag extraction. If we take the following Tweet for example, the requested HTML from the search page contains video in following format - <src="https://twitter.com/i/videos/tweet/881946694413422593?embed_source=clientlib&player_id=0&rpc_init=1" allowfullscreen="" id="player_tweet_881946694413422593" style="width: 100%; height: 100%; position: absolute; top: 0; left: 0;"> So we needed to come up with a better technique to get those videos. Parsing video URL from iFrame The <div> which contains video is marked with AdaptiveMedia-videoContainer class. So if a Tweet has an iFrame containing video, it will also have the mentioned class. Also, the source of iFrame is of the form https://twitter.com/i/videos/tweet/{Tweet-ID}. So now we can programmatically go to any Tweet’s video and parse it to get results. Extracting video URL from iFrame source Now that we have the source of iFrame, we can easily get the video source using the following flow - public final static Pattern videoURL = Pattern.compile("video_url\\\":\\\"(.*?)\\\""); private static String[] fetchTwitterIframeVideos(String iframeURL) {    // Read fron iframeURL line by line into BufferReader br    while ((line = br.readLine()) != null ) {        int index;        if ((index = line.indexOf("data-config=")) >= 0) {            String jsonEscHTML = (new prop(line, index, "data-config")).value;            String jsonUnescHTML = HtmlEscape.unescapeHtml(jsonEscHTML);            Matcher m = videoURL.matcher(jsonUnescHTML);            if (!m.find()) {                return new String[]{};            }            String url = m.group(1);            url = url.replace("\\/", "/");  // Clean URL            /*             * Play with url and return results             */        }    } } MP4 and M3U8 URLs If we encounter mp4 URLs, we’re fine as it is the direct link to video. But if we encounter m3u8 URL, we need to process it further before we can actually get to the videos. For Twitter, the hosted m3u8 videos contain link to further m3u8 videos which are of different resolution. These m3u8 videos again contain link to various .ts files that contain actual video in parts of 3 seconds length each to support better streaming experience on…

Continue ReadingFetching URL for Embedded Twitter Videos in loklak server