Adding Unit Test for Reducer in loklak search

Ngrx/store components are an integral part of the loklak search. All the components are dependent on how the data is received from the reducers. Reducer is like a client-side database which stores up all the data received from the API response. It is responsible for changing the state of the application. Reducers also supplies data to the Angular components from the central Store. If correct data is not received by the components, the application would crash. Therefore, we need to test if the reducer is storing the data in a correct way and changes the state of the application as expected.

Reducer also stores the current state properties which are fetched from the APIs. We need to check if reducers store the data in a correct way and if the data is received from the reducer when called from the angular components.

In this blog, I would explain how to build up different components for unit testing reducers.

Reducer to test

This reducer is used for store the data from suggest.json API from the loklak server.The data received from the server is further classified into three properties which can be used by the components to show up auto- suggestions related to current search query.

  • metadata: – This property stores the metadata from API suggestion response.
  • entities: – This property stores the array of suggestions corresponding to the particular query received from the server.
  • valid: – This is a boolean which keeps a check if the suggestions are valid or not.

We also have two actions corresponding to this reducer. These actions, when called, changes the state properties which , further, supplies data to the components in a more classified manner. Moreover, state properties also causes change in the UI of the component according to the action dispatched.

  • SUGGEST_COMPLETE_SUCCESS: – This action is called when the data is received successfully from the server.
  • SUGGEST_COMPLETE_FAIL: – This action is called when the retrieving data from the server fails.

export interface State {
metadata: SuggestMetadata;
entities: SuggestResults[];
valid: boolean;
}export const initialState: State = {
metadata: null,
entities: [],
valid: true
};export function reducer(state: State = initialState, action: suggestAction.Actions): State {
switch (action.type) {
case suggestAction.ActionTypes.SUGGEST_COMPLETE_SUCCESS: {
const suggestResponse = action.payload;return {
metadata: suggestResponse.suggest_metadata,
entities: suggestResponse.queries,
valid: true
};
}case suggestAction.ActionTypes.SUGGEST_COMPLETE_FAIL: {
return Object.assign({}, state, {
valid: false
});
}default: {
return state;
}
}
}

Unit tests for reducers

  • Import all the actions, reducers and mocks

import * as fromSuggestionResponse from ‘./suggest-response’;
import * as suggestAction from ‘../actions/suggest’;
import { SuggestResponse } from ‘../models/api-suggest’;
import { MockSuggestResponse } from ‘../shared/mocks/suggestResponse.mock’;

 

  • Next, we are going to test if the undefined action doesn’t a cause change in the state and returns the initial state properties. We will be creating an action by const action = {} as any;  and call the reducer by const result = fromSuggestionResponse.reducer(undefined, action);. Now we will be making assertions with expect() block to check if the result is equal to initialState and all the initial state properties are returned

describe(‘SuggestReducer’, () => {
describe(‘undefined action’, () => {
it(‘should return the default state’, () => {
const action = {} as any;const result = fromSuggestionResponse.reducer(undefined, action);
expect(result).toEqual(fromSuggestionResponse.initialState);
});
});

 

  • Now, we are going to test SUGGEST_COMPLETE_SUCCESS and SUGGEST_COMPLETE_FAIL action and check if reducers change only the assigned state properties corresponding to the action in a correct way.  Here, we will be creating action as assigned to the const action variable in the code below. Our next step would be to create a new state object with expected new state properties as assigned to variable const expectedResult below. Now, we would be calling reducer and make an assertion if the individual state properties of the result returned from the reducer (by calling reducer) is equal to the state properties of the expectedResult (Mock state result created to test).

describe(‘SUGGEST_COMPLETE_SUCCESS’, () => {
it(‘should add suggest response to the state’, () => {
const ResponseAction = new suggestAction.SuggestCompleteSuccessAction(MockSuggestResponse);
const expectedResult: fromSuggestionResponse.State = {
metadata: MockSuggestResponse.suggest_metadata,
entities: MockSuggestResponse.queries,
valid: true
};const result = fromSuggestionResponse.reducer(fromSuggestionResponse.initialState, ResponseAction);
expect(result).toEqual(expectedResult);
});
});describe(‘SUGGEST_COMPLETE_FAIL’, () => {
it(‘should set valid to true’, () => {
const action = new suggestAction.SuggestCompleteFailAction();
const result = fromSuggestionResponse.reducer(fromSuggestionResponse.initialState, action);expect(result.valid).toBe(false);
});
});

Reference

Continue Reading

CSS Styling Tips Used for loklak Apps

Cascading Style Sheets (CSS) is one of the main factors which is valuable to create beautiful and dynamic websites. So we use CSS for styling our apps in apps.loklak.org.

In this blog post am going to tell you about few rules and tips for using CSS when you style your App:

1.Always try something new – The loklak apps website is very flexible according to the user whomsoever creates an app. The user is always allowed to use any new CSS frameworks to create an app.

2.Strive for Simplicity – As the app grows, we’ll start developing a lot more than we imagine like many CSS rules and elements etc. Some of the rules may also override each other without we noticing it. It’s good practice to always check before adding a new style rule—maybe an existing one could apply.

3.Proper Structured file –

  • Maintain uniform spacing.
  • Always use semantic or “familiar” class/id names.
  • Follow DRY (Don’t Repeat Yourself) Principle.

CSS file of Compare Twitter Profiles App:

#searchBar {
    width:500px;
}

table {
  border-collapse: collapse;
  width: 70%;
}

th, td {
  padding: 8px;
  text-align: center;
  border-bottom: 1px solid#ddd;
}

 

The output screen of the app:


Do’s and Don’ts while using CSS:

  • Pages must continue to work when style sheets are disabled. In this case this means that the apps which are written in apps.loklak.org should run in any and every case. Let’s say for instance, when a user uses a old browsers or bugs or either because of style conflicts.
  • Do not use the !important attribute to override the user’s settings. Using the !important declaration is often considered bad practice because it has side effects that mess with one of CSS’s core mechanisms: specificity. In many cases, using it could indicate poor CSS architecture.
  • If you have multiple style sheets, then make sure to use the same CLASS names for the same concept in all of the style sheets.
    Do not use more than two fonts. Using a lot of fonts simply because you can will result in a messy look.
  • A firm rule for home page design is more is less : the more buttons and options you put on the home page, the less users are capable of quickly finding the information they need.

Resources:

Continue Reading

How the Compare Twitter Profiles loklak App works

People usually have a tendency to compare their profiles with others, So this is what exactly this app is used for: To compare Twitter profiles. loklak provides so many API’s which serves different functionalities. One among those API’s which I am using to implement this app is loklak’s User Details API. This API actually help in getting all the details of the user we search giving the user name as the query. In this app am going to implement a comparison between two twitter profiles which is shown in the form of tables on the output screen.

Usage of loklak’s User Profile API in the app:

In this app when the user given in the user names in the search fields as seen below:

The queries entered into the search field are taken and used as query in the User Profile API. The query in the code is taken in the following form:

var userQueryCommand = 'http://api.loklak.org/api/user.json?' +
                       'callback=JSON_CALLBACK&screen_name=' +
                       $scope.query;

var userQueryCommand1 = 'http://api.loklak.org/api/user.json?' +
                        'callback=JSON_CALLBACK&screen_name=' +
                        $scope.query1;

The query return a json output from which we fetch details which we need. A simple query and its json output:

http://api.loklak.org/api/user.json?screen_name=fossasia

Sample json output:

{
  "search_metadata": {"client": "162.158.50.42"},
  "user": {
    "$P": "I",
    "utc_offset": -25200,
    "friends_count": 282,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1141238022/fossasia-cubelogo_normal.jpg",
    "listed_count": 185,
    "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/882420659/14d1d447527f8524c6aa0c568fb421d8.jpeg",
    "default_profile_image": false,
    "favourites_count": 1877,
    "description": "#FOSSASIA #OpenTechSummit 2017, March 17-19 in Singapore https://t.co/aKhIo2s1Ck #OpenTech community of developers & creators #Code #Hardware #OpenDesign",
    "created_at": "Sun Jun 20 16:13:15 +0000 2010",
    "is_translator": false,
    "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/882420659/14d1d447527f8524c6aa0c568fb421d8.jpeg",
    "protected": false,
    "screen_name": "fossasia",
    "id_str": "157702526",
    "profile_link_color": "DD2E44",
    "is_translation_enabled": false,
    "translator_type": "none",
    "id": 157702526,
    "geo_enabled": true,
    "profile_background_color": "F50000",
    "lang": "en",
    "has_extended_profile": false,
    "profile_sidebar_border_color": "000000",
    "profile_location": null,
    "profile_text_color": "333333",
    "verified": false,
    "profile_image_url": "http://pbs.twimg.com/profile_images/1141238022/fossasia-cubelogo_normal.jpg",
    "time_zone": "Pacific Time (US & Canada)",
    "url": "http://t.co/eLxWZtqTHh",
    "contributors_enabled": false,
    "profile_background_tile": true,
}

 

I am getting data from the json outputs as shown above, I use different fields from the json output like screen_name, favourites_count etc.

Injecting data from loklak API response using Angular:

As the loklak’s user profile API returns a json format file, I am using Angular JS to align the data according to the needs in the app.

I am using JSONP to retrieve the data from the API. JSONP or “JSON with padding” is a JSON extension wherein a prefix is specified as an input argument of the call itself. This how it is written in code:

$http.jsonp(String(userQueryCommand)).success(function (response) {
    $scope.userData = response.user;
 });

Here the response is stored into a $scope is an application object here. Using the $scope.userData variable , we access the data and display it on the screen using Javascript, HTML and CSS.

<div id="contactCard" style="pull-right">
    <div class="panel panel-default">
        <div class="panel-heading clearfix">
            <h3 class="panel-title pull-left">User 1 Profile</h3>
        </div>
        <div class="list-group">
            <div class="list-group-item">
                <img src="{{userData.profile_image_url}}" alt="" style="pull-left">
                <h4 class="list-group-item-heading" >{{userData.name}}</h4>
            </div>

In this app am also adding keyboard action and validations of fields which will not allow users to search for an empty query using this simple line in the input field.

ng-keyup="$event.keyCode == 13 && query1 != '' && query != '' ? Search() : null"

 


Resources:

Continue Reading

Introducing Priority Kaizen Harvester for loklak server

In the previous blog post, I discussed the changes made in loklak’s Kaizen harvester so it could be extended and other harvesting strategies could be introduced. Those changes made it possible to introduce a new harvesting strategy as PriorityKaizen harvester which uses a priority queue to store the queries that are to be processed. In this blog post, I will be discussing the process through which this new harvesting strategy was introduced in loklak.

Background, motivation and approach

Before jumping into the changes, we first need to understand that why do we need this new harvesting strategy. Let us start by discussing the issue with the Kaizen harvester.

The produce consumer imbalance in Kaizen harvester

Kaizen uses a simple hash queue to store queries. When the queue is full, new queries are dropped. But numbers of queries produced after searching for one query is much higher than the consumption rate, i.e. the queries are bound to overflow and new queries that arrive would get dropped. (See loklak/loklak_server#1156)

Learnings from attempt to add blocking queue for queries

As a solution to this problem, I first tried to use a blocking queue to store the queries. In this implementation, the producers would get blocked before putting the queries in the queue if it is full and would wait until there is space for more. This way, we would have a good balance between consumers and producers as the consumers would be waiting until producers can free up space for them –

public class BlockingKaizenHarvester extends KaizenHarvester {
   ...
   public BlockingKaizenHarvester() {
       super(new KaizenQueries() {
           ...
           private BlockingQueue<String> queries = new ArrayBlockingQueue<>(maxSize);

           @Override
           public boolean addQuery(String query) {
               if (this.queries.contains(query)) {
                   return false;
               }
               try {
                   this.queries.offer(query, this.blockingTimeout, TimeUnit.SECONDS);
                   return true;
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't add query: " + query, e);
                   return false;
               }
           }
           @Override
           public String getQuery() {
               try {
                   return this.queries.take();
               } catch (InterruptedException e) {
                   DAO.severe("BlockingKaizen Couldn't get any query", e);
                   return null;
               }
           }
           ...
       });
   }
}

[SOURCE, loklak/loklak_server#1210]

But there is an issue here. The consumers themselves are producers of even higher rate. When a search is performed, queries are requested to be appended to the KaizenQueries instance for the object (which here, would implement a blocking queue). Now let us consider the case where queue is full and a thread requests a query from the queue and scrapes data. Now when the scraping is finished, many new queries are requested to be inserted to most of them get blocked (because the queue would be full again after one query getting inserted).

Therefore, using a blocking queue in KaizenQueries is not a good thing to do.

Other considerations

After the failure of introducing the Blocking Kaizen harvester, we looked for other alternatives for storing queries. We came across multilevel queues, persistent disk queues and priority queues.

Multilevel queues sounded like a good idea at first where we would have multiple queues for storing queries. But eventually, this would just boil down to how much queue size are we allowing and the queries would eventually get dropped.

Persistent disk queues would allow us to store greater number of queries but the major disadvantage was lookup time. It would terribly slow to check if a query already exists in the disk queue when the queue is large. Also, since the queries would always increase practically, the disk queue would also go out of hand at some point in time.

So by now, we were clear that not dropping queries is not an alternative. So what we had to use the limited size queue smartly so that we do not drop queries that are important.

Solution: Priority Queue

So a good solution to our problem was a priority queue. We could assign a higher score to queries that come from more popular Tweets and they would go higher in the queue and do not drop off until we have even higher priority queried in the queue.

Assigning score to a Tweet

Score for a tweet was decided using the following formula –

α= 5* (retweet count)+(favourite count)

score=α/(α+10*exp(-0.01*α))

This equation generates a score between zero and one from the retweet and favourite count of a Tweet. This normalisation of score would ensure we do not assign an insanely large score to Tweets with a high retweet and favourite count. You can see the behaviour for the second mentioned equation here.

Graph?

Changes required in existing Kaizen harvester

To take a score into account, it became necessary to add an interface to also provide a score as a parameter to the addQuery() method in KaizenQueries. Also, not all queries can have a score associated with it, for example, if we add a query that would search for Tweets older than the oldest in the current timeline, giving it a score wouldn’t be possible as it would not be associated with a single Tweet. To tackle this, a default score of 0.5 was given to these queries –

public abstract class KaizenQueries {

   public boolean addQuery(String query) {
       return this.addQuery(query, 0.5);
   }

   public abstract boolean addQuery(String query, double score);
   ...
}

[SOURCE]

Defining appropriate KaizenQueries object

The KaizenQueries object for a priority queue had to define a wrapper class that would hold the query and its score together so that they could be inserted in a queue as a single object.

ScoreWrapper and comparator

The ScoreWrapper is a simple class that stores score and query object together –

private class ScoreWrapper {

   private double score;
   private String query;

   ScoreWrapper(String m, double score) {
       this.query = m;
       this.score = score;
   }

}

[SOURCE]

In order to define a way to sort the ScoreWrapper objects in the priority queue, we need to define a Comparator for it –

private Comparator<ScoreWrapper> scoreComparator = (scoreWrapper, t1) -> (int) (scoreWrapper.score - t1.score);

[SOURCE]

Putting things together

Now that we have all the ingredients to declare our priority queue, we can also declare the strategy to getQuery and putQuery in the corresponding KaizenQueries object –

public class PriorityKaizenHarvester extends KaizenHarvester {

   private static class PriorityKaizenQueries extends KaizenQueries {
       ...
       private Queue<ScoreWrapper> queue;
       private int maxSize;

       public PriorityKaizenQueries(int size) {
           this.maxSize = size;
           queue = new PriorityQueue<>(size, scoreComparator);
       }

       @Override
       public boolean addQuery(String query, double score) {
           ScoreWrapper sw = new ScoreWrapper(query, score);
           if (this.queue.contains(sw)) {
               return false;
           }
           try {
               this.queue.add(sw);
               return true;
           } catch (IllegalStateException e) {
               return false;
           }
       }

       @Override
       public String getQuery() {
           return this.queue.poll().query;
       }
       ...
}

[SOURCE]

Conclusion

In this blog post, I discussed the process in which PriorityKaizen harvester was introduced to loklak. This strategy is a flavour of Kaizen harvester which uses a priority queue to store queries that are to be processed. These changes were possible because of a previous patch which allowed extending of Kaizen harvester.

The changes were introduced in pull request loklak/loklak#1240 by @singhpratyush (me).

Resources

Continue Reading

Fetching URL for Embedded Twitter Videos in loklak server

The primary web service that loklak scrapes is Twitter. Being a news and social networking service, Twitter allows its users to post videos directly to Twitter and they convey more thoughts than what text can. But for an automated scraper, getting the links is not a simple task.

Let us see that what were the problems we faced with videos and how we solved them in the loklak server project.

Previous setup and embedded videos

In the previous version of loklak server, the TwitterScraper searched for videos in 2 ways –

  1. Youtube links
  2. HTML5 video links

To fetch the video URL from HTML5 video, following snippet was used –

if ((p = input.indexOf("<source video-src")) >= 0 && input.indexOf("type=\"video/") > p) {
   String video_url = new prop(input, p, "video-src").value;
   videos.add
   continue;
}

Here, input is the current line from raw HTML that is being processed and prop is a class defined in loklak that is useful in parsing HTML attributes. So in this way, the HTML5 videos were extracted.

The Problem – Embedded videos

Though the previous setup had no issues, it was useless as Twitter embeds the videos in an iFrame and therefore, can’t be fetched using simple HTML5 tag extraction.

If we take the following Tweet for example,

the requested HTML from the search page contains video in following format –

<src="https://twitter.com/i/videos/tweet/881946694413422593?embed_source=clientlib&player_id=0&rpc_init=1" allowfullscreen="" id="player_tweet_881946694413422593" style="width: 100%; height: 100%; position: absolute; top: 0; left: 0;">

So we needed to come up with a better technique to get those videos.

Parsing video URL from iFrame

The <div> which contains video is marked with AdaptiveMedia-videoContainer class. So if a Tweet has an iFrame containing video, it will also have the mentioned class.

Also, the source of iFrame is of the form https://twitter.com/i/videos/tweet/{Tweet-ID}. So now we can programmatically go to any Tweet’s video and parse it to get results.

Extracting video URL from iFrame source

Now that we have the source of iFrame, we can easily get the video source using the following flow –

public final static Pattern videoURL = Pattern.compile("video_url\\\":\\\"(.*?)\\\"");

private static String[] fetchTwitterIframeVideos(String iframeURL) {
   // Read fron iframeURL line by line into BufferReader br
   while ((line = br.readLine()) != null ) {
       int index;
       if ((index = line.indexOf("data-config=")) >= 0) {
           String jsonEscHTML = (new prop(line, index, "data-config")).value;
           String jsonUnescHTML = HtmlEscape.unescapeHtml(jsonEscHTML);
           Matcher m = videoURL.matcher(jsonUnescHTML);
           if (!m.find()) {
               return new String[]{};
           }
           String url = m.group(1);
           url = url.replace("\\/", "/");  // Clean URL
           /*
            * Play with url and return results
            */
       }
   }
}

MP4 and M3U8 URLs

If we encounter mp4 URLs, we’re fine as it is the direct link to video. But if we encounter m3u8 URL, we need to process it further before we can actually get to the videos.

For Twitter, the hosted m3u8 videos contain link to further m3u8 videos which are of different resolution. These m3u8 videos again contain link to various .ts files that contain actual video in parts of 3 seconds length each to support better streaming experience on the web.

To resolve videos in such a setup, we need to recursively parse m3u8 files and collect all the .ts videos.

private static String[] extractM3u8(String url) {
   return extractM3u8(url, "https://video.twimg.com/");
}

private static String[] extractM3u8(String url, String baseURL) {
   // Read from baseURL + url line by line
   while ((line = br.readLine()) != null) {
       if (line.startsWith("#")) {  // Skip comments in m3u8
           continue;
       }
       String currentURL = (new URL(new URL(baseURL), line)).toString();
       if (currentURL.endsWith(".m3u8")) {
           String[] more = extractM3u8(currentURL, baseURL);  // Recursively add all
           Collections.addAll(links, more);
       } else {
           links.add(currentURL);
       }
   }
   return links.toArray(new String[links.size()]);
}

And then in fetchTwitterIframeVideos, we can return the all .ts URLs for the video –

if (url.endsWith(".mp4")) {
   return new String[]{url};
} else if (url.endsWith(".m3u8")) {
   return extractM3u8(url);
}

Putting things together

Finally, the TwitterScraper can discover the video links by tweaking a little –

if (input.indexOf("AdaptiveMedia-videoContainer") > 0) {
   // Fetch Tweet ID
   String tweetURL = props.get("tweetstatusurl").value;
   int slashIndex = tweetURL.lastIndexOf('/');
   if (slashIndex < 0) {
       continue;
   }
   String tweetID = tweetURL.substring(slashIndex + 1);
   String iframeURL = "https://twitter.com/i/videos/tweet/" + tweetID;
   String[] videoURLs = fetchTwitterIframeVideos(iframeURL);
   Collections.addAll(videos, videoURLs);
}

Conclusion

This blog post explained the process of extracting video URL from Twitter and the problem faced. The discussed change enabled loklak to extract and serve URLs to video for tweets. It was introduced in PR loklak/loklak_server#1193 by me (@singhpratyush).

The service was further enhanced to collect single mp4 link for videos (see PR loklak/loklak_server#1206), which is discussed in another blog post.

Resources

Continue Reading

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Continue Reading

Best Practices when writing Tests for loklak Server

Why do we write unit-tests? We write them to ensure that developers’ implementation doesn’t change the behaviour of parts of the project. If there is a change in the behaviour, unit-tests throw errors. This keep developers in ease during integration of the software and ensure lower chances of unexpected bugs.

After setting up the tests in Loklak Server, we were able to check whether there is any error or not in the test. Test failures didn’t mention the error and the exact test case at which they failed. It was YoutubeScraperTest that brought some of the best practices in the project. We modified the tests according to it.

The following are some of the best practices in 5 points that we shall follow while writing unit tests:

  1. Assert the assertions

There are many assert methods which we can use like assertNull, assertEquals etc. But we should use one which describes the error well (being more descriptive) so that developer’s effort is reduced while debugging.

Using these assertions related preferences help in getting to the exact errors on test fails, thus helping in easier debugging of the code.

Some examples can be:-

  • Using assertThat() over assertTrue

assertThat() give more descriptive errors over assertTrue(). Like:-

When assertTrue() is used:

java.lang.AssertionError: Expected: is <true> but: was <false> at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at org.hamcr.......... 

 

When assertThat() is used:

java.lang.AssertionError:
Expected: is <true>
     but: was <false>
at org.loklak.harvester.TwitterScraperTest.testSimpleSearch(TwitterScraperTest.java:142)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at org.hamcr...........

 

NOTE:- In many cases, assertThat() is preferred over other assert method (read this), but in some cases other methods are used to give better descriptive output (like in next examples)

  • Using assertEquals() over assertThat()

For assertThat()

java.lang.AssertionError:

Expected: is "ar photo #test #car https://pic.twitter.com/vd1itvy8Mx"

but: was "car photo #test #car https://pic.twitter.com/vd1itvy8Mx"

at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)

at org.junit.Assert.assertThat(Ass........

 

For assertEquals()

org.junit.ComparisonFailure: expected:<[c]ar photo #test #car ...> but was:<[]ar photo #test #car ...>

at org.junit.Assert.assertEquals(Assert.java:115)

at org.junit.Assert.assertEquals(Assert.java:144)

at org.loklak.harvester.Twitter.........

 

We can clearly see that second example gives better error description than the first one.(An SO link)

  1. One Test per Behaviour

Each test shall be independent of other with none having mutual dependencies. It shall test only a specific behaviour of the module that is tested.

Have a look of this snippet. This test checks the method that creates the twitter url by comparing the output url method with the expected output url.

@Test

public void testPrepareSearchURL() {

    String url;

    String[] query = {

        "fossasia", "from:loklak_test",

        "spacex since:2017-04-03 until:2017-04-05"

    };

    String[] filter = {"video", "image", "video,image", "abc,video"};

    String[] out_url = {

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

        "https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

    };

    // checking simple urls

    for (int i = 0; i < query.length; i++) {

        url = TwitterScraper.prepareSearchURL(query[i], "");


        //compare urls with urls created

        assertThat(out_url[i], is(url));

    }

}

 

This unit-test tests whether the method-under-test is able to create twitter link according to query or not.

  1. Selecting test cases for the test

We shall remember that testing is a very costly task in terms of processing. It takes time to execute. That is why, we need to keep the test cases precise and limited. In loklak server, most of the tests are based on connection to the respective websites and this step is very costly. That is why, in implementation, we must use least number of test cases so that all possible corner cases are covered.

  1. Test names

Descriptive test names that are short but give hint about their task which are very helpful. A comment describing what it does is a plus point. The following example is from YoutubeScraperTest. I added this point to my ‘best practices queue’ after reviewing the code (when this module was in review process).

/**

* When try parse video from input stream should check that video parsed.

* @throws IOException if some problem with open stream for reading data.

*/

@Test

public void whenTryParseVideoFromInputStreamShouldCheckThatJSONObjectGood() throws IOException {

    //Some tests related to method

}

 

AND the last one, accessing methods

This point shall be kept in mind. In loklak server, there are some tests that use Reflection API to access private and protected methods. This is the best example for reflection API.

In general, such changes to access specifiers are not allowed, that is why we shall resolve this issue with the help of:-

  •  Setters and Getters (if available, use it or else create them)
  •  Else use Reflection

If the getter methods are not available, using Reflection API will be the last resort to access the private and protected members of the class. Hereunder is a simple example of how a private method can be accessed using Reflection:

void getPrivateMethod() throws Exception {

    A ret = new A();

    Class<?> clazz = ret.getClass();

    Method method = clazz.getDeclaredMethod("changeValue", Integer.TYPE);

    method.setAccessible(true);

    System.out.println(method.invoke(ret, 2)); 
    //set null if method is static

}

 

I should end here. Try applying these practices, go through the links and get sync with these ‘Best Practices’ 🙂

Resources:

Continue Reading

Fetching URL for Complete Twitter Videos in loklak server

In the previous blog post, I discussed how to fetch the URLs for Twitter videos in parts (.ts extension). But getting a video in parts is not beneficial as the loklak users have to carry out the following task in order to make sense out of it:

  • Placing the videos in correct order (the videos are divided into 3-second sections).
  • Having proper libraries and video player to play the .ts extension.

This would require fairly complex loklak clients and hence the requirement was to have complete video in a single link with a popular extension. In this blog post, I’ll be discussing how I managed to get links to complete Twitter videos.

Guests and Twitter Videos

Most of the content on Twitter is publicly accessible and we don’t need an account to access it. And this public content includes videos too. So, there should be some way in which Twitter would be handling guest users and serving them the videos. We needed to replicate the same flow in order to get links to those videos.

Problem with Twitter video and static HTML

In Twitter, the videos are not served with the static HTML of a page. It is generally rendered using a front-end JavaScript framework. Let us take an example of mobile.twitter.com website.

Let us consider the video from a tweet of @HiHonourIndia

We can see that the page is rendered using ReactJS and we also have the direct link for the video –

“So what’s the problem then? We can just request the web page and parse HTML to get video link, right?”

Wrong. As I mentioned earlier, the pages are rendered using React and when we initially request it, it looks something like this –

The HTML contains no link to video whatsoever, and keeping in mind that we would be getting the previously mentioned HTML, the scraper wouldn’t be getting any video link either.

We, therefore, need to mimic the flow which is followed internally in the web app to get the video link and play them.

Mimicking the flow of Twitter Mobile to get video links

After tracking the XHR requests made to by the Twitter Mobile web app, one can come up with the forthcoming mentioned flow to get video URLs.

Mobile URL for a Tweet

Getting mobile URL for a tweet is very simple –

String mobileUrl = "https://mobile.twitter.com" + tweetUrl;

Here, tweet URL is of the type /user/tweetID.

Guest Token and Bearer JS URL

The Bearer JS is a file which contains Bearer Token which along with a Guest Token is used to authenticate Twitter API to get details about a conversation. The guest token and bearer script URL can be extracted from the static mobile page –

Pattern bearerJsUrlRegex = Pattern.compile(showFailureMessage\\(\\'(.*?main.*?)\\’\\););
Pattern guestTokenRegex = Pattern.compile(document\\.cookie \\= decodeURIComponent\\(\\\”gt\\=([0-9]+););
ClientConnection conn = new ClientConnection(mobileUrl);
BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
String line;
while ((line = br.readLine()) != null) {
   if (bearerJsUrl != null && guestToken != null) {
       // Both the entities are found
       break;
   }
   if (line.length() == 0) {
       continue;
   }
   Matcher m = bearerJsUrlRegex.matcher(line);
   if (m.find()) {
       bearerJsUrl = m.group(1);
       continue;
   }
   m = guestTokenRegex.matcher(line);
   if (m.find()) {
       guestToken = m.group(1);
   }
}

[SOURCE]

Getting Bearer Token from Bearer JS URL

The following simple method can be used to fetch the Bearer Token from URL –

private static final Pattern bearerTokenRegex = Pattern.compile(BEARER_TOKEN:\\\”(.*?)\\\””);
private static String getBearerTokenFromJs(String jsUrl) throws IOException {
   ClientConnection conn = new ClientConnection(jsUrl);
   BufferedReader br = new BufferedReader(new InputStreamReader(conn.inputStream, StandardCharsets.UTF_8));
   String line = br.readLine();
   Matcher m = bearerTokenRegex.matcher(line);
   if (m.find()) {
       return m.group(1);
   }
   throw new IOException(Couldn\’t get BEARER_TOKEN);
}

[SOURCE]

Using the Guest Token and Bearer Token to get Video Links

The following method demonstrates the process of getting video links once we have all the required information –

private static String[] getConversationVideos(String tweetId, String bearerToken, String guestToken) throws IOException {
   String conversationApiUrl = https://api.twitter.com/2/timeline/conversation/” + tweetId + “.json”;
   CloseableHttpClient httpClient = getCustomClosableHttpClient(true);
   HttpGet req = new HttpGet(conversationApiUrl);
   req.setHeader(User-Agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36);
   req.setHeader(Authorization, Bearer  + bearerToken);
   req.setHeader(x-guest-token, guestToken);
   HttpEntity entity = httpClient.execute(req).getEntity();
   String html = getHTML(entity);
   consumeQuietly(entity);
   try {
       JSONArray arr = (new JSONObject(html)).getJSONObject(globalObjects).getJSONObject(tweets)
               .getJSONObject(tweetId).getJSONObject(extended_entities).getJSONArray(media);
       JSONObject obj2 = (JSONObject) arr.get(0);
       JSONArray videos = obj2.getJSONObject(video_info).getJSONArray(variants);
       ArrayList<String> urls = new ArrayList<>();
       for (int i = 0; i < videos.length(); i++) {
           String url = ((JSONObject) videos.get(i)).getString(url);
           urls.add(url);
       }
       return urls.toArray(new String[urls.size()]);
   } catch (JSONException e) {
       // This is not an issue. Sometimes, there are videos in long conversations but other ones get media class
       //  div, so this fetching process is triggered.
   }
   return new String[]{};
}

[SOURCE]

Checking if a Tweet contains video

If a tweet contains a video, we can add the following lines to recognise it in TwitterScraper.java

if (input.indexOf(AdaptiveMedia-videoContainer) > 0) {
   // Do necessary things
}

[SOURCE]

Limitations

Though this method successfully extracts the video links to complete Twitter videos, it makes the scraping process very slow. This is because, for every tweet that contains a video, three HTTP requests are made in order to finalise the tweet. And keeping in mind that there are up to 20 Tweets per search from Twitter, we get instances where more than 10 of them are videos (30 HTTP requests). Also, there is a lot of JSON and regex processing involved which adds a little to the whole “slow down” thing.

Conclusion

This post explained how loklak server was improved to fetch links to complete video URLs from Twitter and the exact flow of requests in order to achieve so. The changes were proposed in pull requests loklak/loklak_server#1206.

Resources

Continue Reading

Improving Performance of the Loklak Search with Lazy Loading Images

Loklak Search initially faced a problem of huge load time because of a high number of parallel HTTP requests. Feeds page of the search engine made near to 50 parallel HTTP requests on every new search. One of the possible ways to reduce load time in long pages is to lazily load images. Since loklak is a multi-platform web application, a majority of your site’s visitors might use the application from high-latency devices (i.e. mobiles, tablets), then lazy-loading is required for a smooth user experience (as website speed equals User Experience).

I am explaining in this blog about how I took advantage of a directive to implement lazy loading of images and how the performance of the application improved.

What is lazy loading?

As this project website states, “Lazy loading is just the opposite of ‘pre-loading images’. The basic idea behind lazy loading is to keep a number of parallel request low. The amount of request is kept low until the user scrolls down then the images will load.” This idea is used by Google images to reduce the number of parallel requests.

As we can look in the image below, the amount of parallel request for images sent to the server without lazy loading:

Using viewport directive in loklak to lazily load images

Lazy loading can be implemented in Angular using viewport directive. We need to setup viewport and start using it in feed component to lazy load the profile pictures of the users and the images.

Using viewport directive in Components

  • In this directive, we have two observables :-
  1. Scroll which keeps a check on the scroll position of the page
    this.scroll =
    Observable.fromEvent(window, ‘scroll’).subscribe((event) =>
    {
    this.check();
    });

     

  2. Resize which keeps a check on the resize event when the browser window changes size
    this.resize =
    Observable.fromEvent(window, ‘resize’).subscribe((event) =>
    {
    this.check();
    });

     

Now, whenever the resize and scroll event occurs, it calls a method check which calculates the dimensional parameters of the element and whenever element has entered the viewport it emits an event (which is of type boolean) with value true. check() function takes up reference of the element. Next, it calculates area occupied by this element i.e. elementSize. Next, it looks for the position of the element within the viewport with getBoundingClientRect(). Parameter partial can be set to true if the we need to check if image appears in viewport partially. Moreover, we can specify parameter direction to specify the direction from which the reference element will be entering. Now, we would use conditional statements to check if element exists within the dimensions of viewport and would emit an event.

check(partial:boolean = true, direction:string = ‘both’) {
const el = this._el.nativeElement;const elSize = (el.offsetWidth * el.offsetHeight);const rec = el.getBoundingClientRect();const vp = {
width: window.innerWidth,
height: window.innerHeight
};const tViz = rec.top >= 0 && rec.top < vp.height;
const bViz = rec.bottom > 0 && rec.bottom <= vp.height;const lViz = rec.left >= 0 && rec.left < vp.width;
const rViz = rec.right > 0 && rec.right <= vp.width;const vVisible = partial ? tViz || bViz : tViz && bViz;
const hVisible = partial ? lViz || rViz : lViz && rViz;let event = {
target: el,
value: false
};if (direction === ‘both’) {
event[‘value’] = (elSize && vVisible && hVisible) ? true : false;
}
else if (direction === ‘vertical’) {
event[‘value’] = (elSize && vVisible) ? true : false;
}
else if (direction === ‘horizontal’) {
event[‘value’] = (elSize && hVisible) ? true : false;
}this.inViewport.emit(event);
}

 

  • Next, we need to import viewport directive in our component using this structure:

import { InViewportDirective } from ‘../shared//in-viewport.directive’;

declarations: [

InViewportDirective

]
})

 

  • Create a method in the respective component’s class that would keep a check on the boolean value returned by the directive

public inview(event) {
if (event.value === true) {
this.inviewport = event.value;
}
}

 

In this step we use viewport directive as an attribute directive and the $event value returned would be passed as a parameter to inview method of the component’s class. Here basically, if the feed card comes into viewport (even partially) then the event is emitted and image element is displayed on the screen.Now as the image is displayed, a call is made to receive the images. Control over img is made using *ngIf statement and checks if inviewport is true or false.

<span class=“card” in-viewport (inViewport)=”inview($event)”>
<img src={{feedItem.user.profile_image_url_https}}” *ngIf=“inviewport”/>
</span>

 

The images will load lazily only when the element is in the viewport of the device. Consequently, the amount of parallel image requests sent would decrease and application performance will increase. In fact, not just images but this directive can be used for lazy loading of all media elements in your application.

The image below shows that the reduced amount of image requests sent during the initialization of the feed result when viewport directive is used:-

Resources

Continue Reading

Implementing Auto-Suggestions in loklak search

Auto-suggestions can add a friendly touch to the search engine. Loklak provides suggest.json API to give suggestions based on the previous search terms entered by the users. Moreover, suggest results needs to be reactive and quick enough to be displayed as soon as the user types a new character.

The main demand of the auto-suggestion feature was to make it really quick so as to make it look reactive.

I will explain how I implemented auto-suggestion feature and try to explain issues I faced and solution for that.

Ngrx Effects

The cycle for implementing auto-suggest goes like this:

The most important component in this cycle is effects as it is the event listener and it recognises the action immediately after it is dispatched and makes a call to loklak suggest.json API. We will look at how effects should look like for a reactive implementation of auto-suggestion. Making effects run effectively can make Rate Determining Step to be the API response time instead of any other component.

@Injectable()
export class SuggestEffects {@Effect()
suggest$: Observable<Action>
= this.actions$
.ofType(suggestAction.ActionTypes.SUGGEST_QUERY)
.debounceTime(300)
.map((action: suggestAction.SuggestAction) => action.payload)
.switchMap(query => {
const nextSuggest$ = this.actions$.ofType(suggestAction.ActionTypes.SUGGEST_QUERY).skip(1);return this.suggestService.fetchQuery(query)
.takeUntil(nextSuggest$)
.map(response => {
return new suggestAction.SuggestCompleteSuccessAction(response);
})
.catch(() => of(new suggestAction.SuggestCompleteFailAction()));
});constructor(
private actions$: Actions,
private suggestService: SuggestService,
) { }}

 

This effect basically listens to the action  SUGGEST_QUERY and recognises the action, next it makes a call to the Suggestion service which receives data from the server. Now, as the data is received, it maps the response and passes the response to the SuggestCompleteSuccessAction so that it could change the considered state properties. The debounce time is kept low (equal to 300) so as to detect next SUGGEST_QUERY within next 300ms of the API suggest.json call. This will help to whole cycle of suggest response reactive and help in early detection of the action.

Angular Components

In this component, I will explain what changes I made to the feed header component for autocomplete to run effectively.

@Component({
selector: ‘feed-header-component’,
templateUrl: ‘./feed-header.component.html’,
styleUrls: [‘./feed-header.component.scss’],
changeDetection: ChangeDetectionStrategy.OnPush
})
export class FeedHeaderComponent {
public suggestQuery$: Observable<Query>;
public isSuggestLoading$: Observable<boolean>;
public suggestResponse$: Observable<SuggestResults[]>;
public searchInputControl = new FormControl();
constructor(
private store: Store<fromRoot.State>,
) { }ngOnInit() {
this.getDataFromStore();
this.setupSearchField();
}private getDataFromStore(): void {
this.suggestQuery$ = this.store.select(fromRoot.getSuggestQuery);
this.isSuggestLoading$ = this.store.select(fromRoot.getSuggestLoading);
this.suggestResponse$ = this.store.select(fromRoot.getSuggestResponseEntities);
}private setupSearchField(): void {
this.__subscriptions__.push(
this.searchInputControl
.valueChanges
.subscribe(query => {
this.store.dispatch(new suggestAction.SuggestAction(value));
})

);
}
}

 

We have created a FormControl searchInputControl which would keep a check on the input value of the search box and would dispatch an action SuggestAction() as soon as its value changes. Next, we have to subscribe to the observables of suggestion entities from the store as in the function “this.getDataFromStore()”.The data is received from the store and now we can proceed with displaying suggestion data in the template.

Template

HTML language has an tag datalist which can be used to display auto-suggestion in the template but datalist shows less compatibility with the Angular application and makes rendering HTML slow with next suggestion based on the some different history of the query.

Therefore, we can use either Angular Material’s md-autocomplete or create an auto-suggestion box to display suggestions from scratch (this would add an advantage of customising CSS/TS code). md-autocomplete would add an advantage with of this element being compatible with the Angular application so in loklak search, we would prefer that.

<input mdInput required search autocomplete=“off” type=“text” id=“search” [formControl]=“searchInputControl” [mdAutocomplete]=“searchSuggestBox”
(keyup.enter)=“relocateEvent.emit(query)” [(ngModel)]=“query”/><mdautocomplete #searchSuggestBox=”mdAutocomplete”><mdoption *ngFor=“let suggestion of (suggestResponse$ | async).slice(0, 4)” [value]=“(suggestQuery$ | async)”>{{suggestion.query}}</mdoption>

</mdautocomplete>

 

In the input element, we added an attribute [mdAutocomplete]=”searchSuggestBox” which links md-autocomplete to the input box #searchSuggestBox=”mdAutocomplete”. This makes auto suggestion box attached to the input.

Now, the chain is created with user input being the source of action and display of auto-suggestion data being the reaction.

References:

Continue Reading
Close Menu