Implementing Tweet Search Suggestions in Loklak Wok Android

Loklak Wok Android not only is a peer harvester for Loklak Server but it also provides users to search tweets using Loklak’s API endpoints. To provide a better search tweet search experience to the users, the app provides search suggestions using suggest API endpoint. The blog describes how “Search Suggestions” is implemented.

Third Party Libraries used to Implement Suggestion Feature

  • Retrofit2: Used for sending network request
  • Gson: Used for serialization, JSON to POJOs (Plain old java objects).
  • RxJava and RxAndroid: Used to implement a clean asynchronous workflow.
  • Retrolambda: Provides support for lambdas in Android.

These libraries can be installed by adding the following dependencies in app/build.gradle

android {
   .
   // removes rxjava file repetations
   packagingOptions {
      exclude 'META-INF/rxjava.properties'
   }
}

dependencies {
   // gson and retrofit2
    compile 'com.google.code.gson:gson:2.8.1'
    compile 'com.squareup.retrofit2:retrofit:2.3.0'
    compile 'com.squareup.retrofit2:converter-gson:2.3.0'
    compile 'com.squareup.retrofit2:adapter-rxjava2:2.3.0'

   // rxjava and rxandroid
    compile 'io.reactivex.rxjava2:rxjava:2.0.5'
    compile 'io.reactivex.rxjava2:rxandroid:2.0.1'
    compile 'com.jakewharton.rxbinding2:rxbinding:2.0.0'
}

 

To add retrolambda

// in project's build.gradle
dependencies {
    
    classpath 'me.tatarka:gradle-retrolambda:3.2.0'
}

// in app level build.gradle at the top
apply plugin: 'me.tatarka.retrolambda'

 

Fetching Suggestions

Retrofit2 sends a GET request to search API endpoint, the JSON response returned is serialized to Java Objects using the models defined in models.suggest package. The models can be easily generated using JSONSchema2Pojo. The benefit of using Gson is that, the hard work of parsing JSON is easily handled by it. The static method createRestClient creates the retrofit instance to be used for network calls

private static void createRestClient() {
   sRetrofit = new Retrofit.Builder()
           .baseUrl(BASE_URL) // base url : https://api.loklak.org/api/
           .addConverterFactory(GsonConverterFactory.create(gson))
           .addCallAdapterFactory(RxJava2CallAdapterFactory.create())
           .build();
}

 

The suggest endpoint is defined in LoklakApi interface

public interface LoklakApi {

   @GET("/api/suggest.json")
   Observable<SuggestData> getSuggestions(@Query("q") String query);

   @GET("/api/suggest.json")
   Observable<SuggestData> getSuggestions(@Query("q") String query, @Query("count") int count);

   .
}

 

Now, the suggestions are obtained using fetchSuggestion method. First, it creates the rest client to send network requests using createApi method (which internally calls creteRestClient implemented above). The suggestion query is obtained from the EditText. Then the RxJava Observable is subscribed in a separate thread which is specially meant for doing IO operations and finally the obtained data is observed i.e. views are inflated in the MainUI thread.

private void fetchSuggestion() {
   LoklakApi loklakApi = RestClient.createApi(LoklakApi.class); // rest client created
   String query = tweetSearchEditText.getText().toString(); // suggestion query from EditText
   Observable<SuggestData> suggestionObservable = loklakApi.getSuggestions(query); // observable created
   Disposable disposable = suggestionObservable
           .subscribeOn(Schedulers.io()) // subscribed on IO thread
           .observeOn(AndroidSchedulers.mainThread()) // observed on MainUI thread
           .subscribe(this::onSuccessfulRequest, this::onFailedRequest); // views are manipulated accordingly
   mCompositeDisposable.add(disposable);
}

 

If the network request is successful onSuccessfulRequest method is called which updates the data in the RecyclerView.

private void onSuccessfulRequest(SuggestData suggestData) {
   if (suggestData != null) {
       mSuggestAdapter.setQueries(suggestData.getQueries()); // data updated.
   }
   setAfterRefreshingState();
}

 

If the network request fails then onFailedRequest is called which displays a toast saying “Cannot fetch suggestions, Try Again!”. If requests are sent simultaneously and they fail, the previous message i.e. the previous toast is removed.

private void onFailedRequest(Throwable throwable) {
   Log.e(LOG_TAG, throwable.toString());
   if (mToast != null) { // checks if a previous toast is present
       mToast.cancel(); // removes the previous toast.
   }
   setAfterRefreshingState();
   // value of networkRequestError: "Cannot fetch suggestions, Try Again!"
   mToast = Toast.makeText(getActivity(), networkRequestError, Toast.LENGTH_SHORT); // toast is crated
   mToast.show(); // toast is displayed
}

 

Lively Updating suggestions

One way to update suggestions as the user types in, is to send a GET request with a query parameter to suggest API endpoint and check if a previous request is incomplete cancel it. This includes a lot of IO work and seems unnecessary because we would be sending request even if the user hasn’t completed typing what he/she wants to search. One probable solution is to use a Timer.

private TextWatcher searchTextWatcher = new TextWatcher() {  
    @Override
    public void afterTextChanged(Editable arg0) {
        // user typed: start the timer
        timer = new Timer();
        timer.schedule(new TimerTask() {
            @Override
            public void run() {
                String query = editText.getText()
                // send network request to get suggestions
            }
        }, 600); // 600ms delay before the timer executes the „run" method from TimerTask
    }

    @Override
    public void beforeTextChanged(CharSequence s, int start, int count, int after) {}

    @Override
    public void onTextChanged(CharSequence s, int start, int before, int count) {
        // user is typing: reset already started timer (if existing)
        if (timer != null) {
            timer.cancel();
        }
    }
};

 

This one looks good and eventually gets the job done. But, don’t you think for a simple stuff like this we are writing too much of code that is scary at the first sight.

Let’s try a second approach by using RxBinding for Android views and RxJava operators. RxBinding simply converts every event on the view to an Observable which can be subscribed and worked upon.

In place of TextWatcher here we use RxTextView.textChanges method that takes an EditText as a parameter. textChanges creates Observable of character sequences for text changes on view, here EditText. Then debounce operator of RxJava is used which drops observables till the timeout expires, a clean alternative for Timer. The second approach is implemented in updateSuggestions.

private void updateSuggestions() {
   Disposable disposable = RxTextView.textChanges(tweetSearchEditText) // generating observables for text changes
           .debounce(400, TimeUnit.MILLISECONDS) // droping observables i.e. text changes within 4 ms
           .subscribe(charSequence -> {
               if (charSequence.length() > 0) {
                   fetchSuggestion(); // suggestions obtained
               }
           });
   mCompositeDisposable.add(disposable);
}

 

CompositeDisposable is a bucket that contains Disposable objects, which are returned each time an Observable is subscribed to. So, all the disposable objects are collected in CompositeDisposable and unsubscribed when onStop of the fragment is called to avoid memory leaks.

A SwipeRefreshLayout is used, so that user can retry if the request fails or refresh to get new suggestions. When refreshing is not complete a circular ProgressDialog is shown and the RecyclerView showing old suggestions is made invisible, executed by setBeforeRefreshingState method

private void setBeforeRefreshingState() {
   refreshSuggestions.setRefreshing(true);
   tweetSearchSuggestions.setVisibility(View.INVISIBLE);
}

 

Similarly, once refreshing is done ProgessDialog is stopped and the visibility of RecyclerView which now contains the updated suggestions is changed to VISIBLE. This is executed in setAfterRefreshingState method

private void setAfterRefreshingState() {
   refreshSuggestions.setRefreshing(false);
   tweetSearchSuggestions.setVisibility(View.VISIBLE);
}

 

References:

Continue ReadingImplementing Tweet Search Suggestions in Loklak Wok Android

Using NodeJS modules of Loklak Scraper in Android

Loklak Scraper JS implements scrapers of social media websites so that they can be used in other platforms, like Android or in a native Java project. This way there will be only a single source of scraper, as a result it will be easier to update the scrapers in response to the change in websites. This blog explains how Loklak Wok Android, a peer for Loklak Server on Android platform uses the Twitter JS scraper to scrape tweets.

LiquidCore is a library available for android that can be used to run standard NodeJS modules. But Twitter scraper can’t be used directly, due to the following problems:

  • 3rd party NodeJS libraries are used to implement the scraper, like cheerio and request-promise-native and LiquidCore doesn’t support 3rd party libraries.
  • The scrapers are written in ES6, as of now LiquidCore uses NodeJS 6.10.2, which doesn’t support ES6 completely.

So, if 3rd party NodeJS libraries can be included in our scraper code and ES6 can be converted to ES5, LiquidCore can easily execute Twitter scraper.

3rd party NodeJS libraries can be bundled into Twitter scraper using Webpack and ES6 can be transpiled to ES5 using Babel.

The required dependencies can be installed using:

$npm install --save-dev webpack
$npm install --save-dev babel-core babel-loader babel-preset-es2015

Bundling and Transpiling

Webpack does bundling based on the configurations provided in webpack.config.js, present in root directory of project.

var fs = require('fs');

function listScrapers() {
   var src = "./scrapers/"
   var files = {};
   fs.readdirSync(src).forEach(function(data) {
       var entryName = data.substr(0, data.indexOf("."));
       files[entryName] = src+data;
   });
   return files;
}

module.exports = {
 entry: listScrapers(),
 target: "node",
 module: {
     loaders: [
         {
             loader: "babel-loader",
             test: /\.js?$/,
             query: {
                 presets: ["es2015"],
             }
         },
     ]
 },
 output: {
   path: __dirname + '/build',
   filename: '[name].js',
   libraryTarget: 'var',
   library: '[name]',
 }
};

 

Now let’s break the config file, the function listScrapers returns a JSONObject with key as name of scraper and value as relative location of scraper, ex:

{
   twitter: "./scrapers/twitter.js",
   github: "./scrapers/github.js"
   // same goes for other scrapers
}

The parameters in module.exports as described in the documentation of webpack for multiple inputs and to use the generated output externally:

  • entry: Since a bundle file is required for each scraper we provide the  the JSONObject returned by listScrapers function. The multiple entry points provided generate multiple bundled files.
  • target: As the bundled files are to be used in NodeJS platform,  “node” is set here.
  • module: Using webpack the code can be directly transpiled while bundling, the end users don’t need to run separate commands for transpiling. module contains babel configurations for transpiling.
  • output: options here customize the compilation of webpack
    • path: Location where bundled files are kept after compilation, “__dirname” means the current directory i.e. root directory of the project.
    • filename: Name of bundled file, “[name]“ here refers to the key of JSONObject provided in entry i.e. key of JSONObect returned from listScrapers. Example for Twitter scraper, the filename of bundled file will be “twitter.js”.
    • libraryTarget: by default the functions or methods inside bundled files can’t be used externally – can’t be imported. By providing the “var” the functions in bundled module can be accessed.
    • library: the name of the library.

Now, time to do the compilation work:

$ ./node_modules/.bin/webpack

The bundled files can be found in build directory. But, the generated bundled files are large files – around 77,000 lines. Large files are not encouraged for production purposes. So, a “-p” flag is used to generate bundled files for production – around 400 lines.

$ ./node_modules/.bin/webpack -p

Using LiquidCore to execute bundled files

The generated bundled file can be copied to the raw directory in res (resources directory in Android). Now, events are emitted from Activity/Fragment and in response to those events the scraping function is invoked in the bundled JS file, present in raw directory, the vice-versa is also possible.

So, we handle some events in our JS file and send some events to the android Activity/Fragment. The event handling and event creating code in JS file:

var query = "";
LiquidCore.on("queryEvent", function(msg) {
  query = msg.query;
});

LiquidCore.on("fetchTweets", function() {
  var twitterScraper = new twitter();
  twitterScraper.getTweets(query, function(data) {
    LiquidCore.emit("getTweets", {"query": query, "statuses": data});
  });
});

LiquidCore.emit('start');

 

First a “start” event is emitted from JS file, which is consumed in TweetHarvestingFragment by getScrapedTweet method using startEventListener.

EventListener startEventListener = (service, event, payload) -> {
   JSONObject jsonObject = new JSONObject();
   try {
       jsonObject.put("query", query);
       service.emit(LC_QUERY_EVENT, jsonObject); // value of LC_QUERY_EMIT is  "queryEvent"
   } catch (JSONException e) {
       Log.e(LOG_TAG, e.toString());
   }
   service.emit(LC_FETCH_TWEETS_EVENT); //value of  LC_FETCH_TWEETS_EVENT is  "fetchTweets"
};

 

The startEventListener then emits “queryEvent” with a JSONObject that contains the query to search tweets for scraping. This event is consumed in JS file by:

var query = "";
LiquidCore.on("queryEvent", function(msg) {
  query = msg.query;
});

 

After “queryEvent”, “fetchTweets” event is emitted from fragment, which is handled in JS file by:

LiquidCore.on("fetchTweets", function() {
  var twitterScraper = new twitter(); // scraping object is created
  twitterScraper.getTweets(query, function(data) { // function that scrapes twitter
    LiquidCore.emit("getTweets", {"query": query, "statuses": data});
  });
});

 

Once the scraped data is obtained, it is sent back to fragment by emitting “getTweets” event from JS file, “{“query”: query, “statuses”: data}” contains scraped data. This event is consumed in android by getTweetsEventListener.

EventListener getTweetsEventListener = (service, event, payload) -> { // payload contains scraped data
   Push push = mGson.fromJson(payload.toString(), Push.class);
   emitter.onNext(push);
};

 

LiquidCore creates a NodeJS instance to execute the bundled JS file. The NodeJS instance is called MicroService in LiquidCore terminology. For all this event handling to work, the NodeJS instance is created inside the method with a ServiceStartListner where all EventListener are added.

MicroService.ServiceStartListener serviceStartListener = (service -> {
   service.addEventListener(LC_START_EVENT, startEventListener);
   service.addEventListener(LC_GET_TWEETS_EVENT, getTweetsEventListener);
});
URI uri = URI.create("android.resource://org.loklak.android.wok/raw/twitter"); // Note .js is not used
MicroService microService = new MicroService(getActivity(), uri, serviceStartListener);
microService.start();

Resources

Continue ReadingUsing NodeJS modules of Loklak Scraper in Android

Resource Injection Using ButterKnife in Loklak Wok Android

Loklak Wok Android being a sophisticated Android app uses a lot of views, and of those most are manipulated at runtime. In Android to play with a View or ViewGroup defined in XML at runtime requires developers to add the following line:

(TypeOfView) parentView.findViewById(R.id.id_of_view);

 

This leads to lengthy code. And very often, more than one Views respond to a particular event. For example, hiding Views if a network request fails and showing a message to the user to “Try Again!”. Let’s say you have to hide 4 Views, are you going to do the following:

view1.setVisibility(View.GONE);
view2.setVisibility(View.GONE);
view3.setVisibility(View.GONE);
view4.setVisibility(View.GONE);
textView.setVisibility(View.VISIBLE); // has "Try Again!" message.

// more 5 lines of code when hiding textView and displaying 4 other Views

 

Surely not! And the old fashioned way to get a string value defined as a resource in string.xml

String appName = getActivity().getResources().getString(R.id.app_name);

 

Surely, all this works good, but being a developer while working on a sophisticated app you would like to focus on the logic of the app, rather than scratching your head to debug whether you properly did a findViewById or not, did you typecast it to the proper View, or where did you miss to change the visibility of a view in response to an event.

Well, all of this can be easily handled by using a library which provides you the dependency, here resources. All you need to do is just declare your resources, and that’s it, the library provides the resources to you, yes you don’t need to initialize it using findViewById. So let’s dive in and see how ButterKnife is used in Loklak Wok Android to handle these issues.

Adding ButterKnife to Android Project

In the app/build.gradle:

dependencies {
    compile 'com.jakewharton:butterknife:8.6.0'
    annotationProcessor 'com.jakewharton:butterknife-compiler:8.6.0'
    ...
}

Dealing with Views in Fragments

When views are declared, BindView annotation is used with its parameter as the ID of the view, for example, views in TweetHarvestingFragment :

@BindView(R.id.toolbar)
Toolbar toolbar;
@BindView(R.id.harvested_tweets_count)
TextView harvestedTweetsCountTextView;
@BindView(R.id.harvested_tweets_container)
RecyclerView recyclerView;
@BindView(R.id.network_error)
TextView networkErrorTextView;

NOTE: Views declared can’t be private.

Once Views are declared, then it needs to be injected, it is done using ButterKnife.bind(Object target, View Source). Here in TweetHarvestingFragment the target will be the fragment itself and source i.e. the parent view will be rootView (obtained by inflating the layout file of fragment). All this needs to be done in onCreateView method

View rootView = inflater.inflate(R.layout.fragment_tweet_harvesting, container, false);
ButterKnife.bind(this, rootView);

That’s it, we are done!

The same paradigm can be used to bind views to a ViewHolder of a RecyclerView, as implemented in HarvestTweetViewHolder:

@BindView(R.id.user_fullname)
TextView userFullname;
@BindView(R.id.username)
TextView username;
@BindView(R.id.tweet_date)
TextView tweetDate;
@BindView(R.id.harvested_tweet_text)
TextView harvestedTweetTextView;

public HarvestedTweetViewHolder(View itemView) {
   super(itemView);
   ButterKnife.bind(this, itemView);
}

 

Injecting resources like strings, dimensions, colors, drawables etc. is even easier, only the related annotation and ID needs to be provided. Example the string app_name is used in TweetHarvestingFragment to display the app name i.e. “Loklak Wok” in toolbar

@BindString(R.string.app_name)
String appName;

// directly used inside onCreateView to set the title in toolbar
toolbar.setTitlet(appName);

 

Using ButterKnife onClickListeners can be implemented in separate method, a clean way to define click events rather than polluting onCreate(in Activity) or onCreateView(in Fragment), as implemented in SuggestFragment

@OnClick(R.id.clear_image_button)
public void onClickedClearImageButton(View view) {
   tweetSearchEditText.setText("");
}

 

Multiple Views responding to a single Event

Using @BindViews annotation a list of multiple views can be created, and then a common ButterKnife.Action can be defined to act on the list of views. In TweetHarvestingFragment visibility of some views are changed if a network request fails or succeeds using this feature of ButterKnife:

// declaring List of Views
@BindViews({
       R.id.harvested_tweets_count,
       R.id.harvested__tweet_count_message,
       R.id.harvested_tweets_container})
List<View> networkViews;

// defining action for views
private final ButterKnife.Action<View> VISIBLE = (view, index) -> view.setVisibility(View.VISIBLE);
private final ButterKnife.Action<View> GONE = (view, index) -> view.setVisibility(View.GONE);

// applying action on the views
ButterKnife.apply(networkViews, VISIBLE);
ButterKnife.apply(networkViews, GONE);

Resources

Continue ReadingResource Injection Using ButterKnife in Loklak Wok Android

Editing files and piped data with “sed” in Loklak server

What is sed ?

“sed” is used in “Loklak Server” one of the most popular projects of FOSSASIA. “sed” is acronym for “Stream Editor” used for filtering and transforming text, as mentioned in the manual page of sed. Stream can be a file or input from a pipeline or even standard input. Regular expressions are used to filter the text and transformation are carried out using sed commands, either inline or from a file. So, most of the time writing a single line does the work of text substitution, removal or to obtaining a value from a text file.

Basic Syntax of “sed”

$sed [options]... {inline commands or file having sed commands} [input_file]...

Loklak Server uses a config.properties file – contains key-value pairs – which is the basis of the server as it contains configuration values, used by the server during runtime for various operations.

Let’s go through a simple sed example that prints line containing word “https” at the beginning in the config.properties file.

$sed -n '/^https/p' config.properties

Here “-n” option suppresses automatically printing of pattern space (pattern space is where each line is put that is to be processed by sed). Without “-n” option sed will print the whole file.

Now, the regular expression part,  “/^https” matches all the lines that has “https” at the start of line and “/p” is print command to print the output in console. Finally we provide the filename i.e. config.properties. If filename is not provided then sed waits for input from standard input.

Use cases of “sed” in Loklak Server

  • Displaying proper port number in message while starting or installing Loklak Server

The default port of loklak server is port number 9000, but it can be started in any non-occupied port by using “-p” flag with bin/start.sh and bin/installation.sh like

$ bin/installation.sh -p 8888

starts installation of Loklak Server in port 8888. To display the proper localhost address so that user can open it in a browser the port number in shortlink.urlstub parameter in config.properties needs to be changed. This is carried out by the function change_shortlink_urlstub in bin/utility.sh. The function is defined as

Now let’s try to understand what the sed command is doing.

“-i” option is used for in-place editing of the specified file i.e. config.properties in conf directory.

s” is substitute command of sed. The regular expression can be divided into two parts, between “/”:

  1. \(shortlink\.urlstub=http:.*:\)\(.*\) this is used to find the match in a line.
  2. \1′”$1″ is used to substitute the matched string in part 1.

The regular expressions can be split into groups so that operations can be performed on each group separately. A group is enclosed between “\(“ and “\)”. In our 1st part of regular expressions, there are two groups.

Dissecting the first group i.e. \(shortlink\.urlstub=http:.*:\):

  • shortlink\.urlstub=http:” will match the expression “shortlink.urlstub=http:”, here “\” is used as an escape sequence as “.” in regex represents any character.
  • “.*:”, “.” represents any character and “*” represents 0 or more characters of the previous character. So, it will basically match any expression. But, it ends with a “:”, which means any expression that ends with a “:”. Thus it matches the expression “//localhost:”.

So, the first group collectively matches the expression “shortlink.urlstub=http://localhost:”.

As described above second group i.e. \(.*\) will match any expression, here matches “9000”.

Now coming to the 2nd part of regular expression i.e. \1′”$1″:

  • “\1” represents the match of the first group in 1st part i.e. “shortlink.urlstub=http://localhost:”.
  • “$1” is the value of the first parameter provided to our function “change_shortlink_urlstub” which is an unused port where we want to run Loklak Server.

So 2nd part picks up the match from the first group and concatenates with the first parameter of the function. Assuming the first parameter to the function is “8888”, the expression for 2nd part becomes “shortlink.urlstub=http://localhost:8888” which replaces “shortlink.urlstub=http://localhost:9000”.

So a correct localhost address is displayed in the console while starting or installing Loklak Server.

  • Extracting value of a key from config.properties

“grep” and “sed” are used to extract the values of key from config.properties in bash scripts e.g. extracting default port, value of “port.http” in bin/start.sh, value of “shortlink.urlstub” in bin/start.sh and bin/installation.sh.

$ grep -iw 'port.http' conf/config.properties | sed 's/^[^=]*=//'

Here grep is used to filter a line and pass the filtered line to sed by piping the output. “i” flag is for ignoring case sensitivity and “w” flag is used for matching of word only. The output of

$ grep -iw 'port.http' conf/config.properties
port.http=9000

The aim is to get “9000”, the value of “port.http”. The approach used here is to substitute “port.http=” in the output of grep command above with a string of zero characters, that way only “9000” remains.

Let’s deconstruct the “sed” part, s/^[^=]*=//:

s” command of sed is used for substitution.  Here 1st part is “^[^=]*=” and 2nd part is nothing, as no characters are enclosed within “//”.

  • “^” means to match at the start.
  • [^characters] represents not to consider a set of characters. For matching a set of characters “^” is not used inside square brackets, [characters]. Here [^=] means not to include equal to – “=” symbol – and “*” after that makes it to match characters that are not “=”.

So “^[^=]*=” matches a sequence of characters that doesn’t start with “=” and followed by “=”. Thus the matched expression in this case is “http.port=” (“h” is not “=” and it ends with “=”), which is substituted by a string with zero characters which leaves “9000”.  Finally, “9000” is assigned to the required variable is bash script.

Continue ReadingEditing files and piped data with “sed” in Loklak server

Scraping in JavaScript using Cheerio in Loklak

FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java.

Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper.

Parsing HTML

Steps involved in web-scraping:

  1. HTML source of the webpage is obtained.
  2. HTML source is parsed and
  3. The parsed HTML is traversed to extract the required data.

For 2nd and 3rd step we use cheerio.

Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request.

Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function.

var $ = cheerio.load(htmlSourceOfWebpage);

 

Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”.

Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag.

var head = cheerio.load($("head").html());

html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter.

Extracting data from parsed HTML

Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute :

 

  • In website (for better view):

  • In static HTML, obtained from “GET” request using getHtml function (for better view):

So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly.

Now, let’s dive into the actual scraping stuff.

As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value.

function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) {
	var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]';
	return cheerioFunction(selector).attr("content");
}

cheerioObject” here will be the “head” object created above.

For example, our final JSONObject contains a og_url key-value pair, to get that we need to obtain the following html element.

<meta property="og:url" content="https://www.youtube.com/watch?v=KVGRN7Z7T1A">

 

This can be obtained by:

  1. Writing a selector for property attribute of meta. The selector would be ‘meta[property=”og:url”]’.
  2. The selector is passed to cheerioObject.
  3. Then attr method is used to obtain the value of content attribute.
  4. Finally, we set the obtained value of content attribute as the value of JSONObject’s key.

Similarly og:site_name, og:url and other values can be extracted, which in the final JSONObject would be the value of keys og_site_name, og_url and similarly. Since, a lot of data needs to be extracted this way, the extractMetaAttribute function generalizes it, where metaAttribute is “property” and metaAttributeValue is “og:url” in the above example.

If one parameter is provided in attr method, then it is used as a getter method, the value of that attribute is returned. If two parameters are provided then first parameter is the name of attribute and second parameter is the value of attribute, in this case it is used as a setter method.

Now, what if the provided selector matches more than one html element and we need to extract data or perform some operations on all of them. The answer is using each method on the cheerio Object, it iterates over the matched elements and executes the passed function – as a parameter – on them. The passed function has two parameters, the index of matched element and the matched element itself. To break out of the loop early, false is returned.

One of the use case of each method in youtube scraper is to extract related “tags” of the video.

Selector for this would be ‘meta[property=”og:video:tag”]’ and as it is inside a head tag, we can use the already created head tag. Applying the each method, it becomes:

head('meta[property="og:video:tag"]').each(function(i, element) {
    // the logic goes here
});

 

Here for the first iteration the value of “i” will be “0” and “element” will be

<meta property="og:video:tag" content="Iggy">

 

and so on. We need to obtain the value of content attribute, so we can use attr method as used above. Finally all the values are pushed to an array. Hence, the final code snippet with logic.

var ary = [];
head('meta[property="og:video:tag"]').each(function(i, element) {
    ary.push(head(element).attr("content"));
});

 

The same functionality is implemented in extractMetaProperties method.

function extractMetaProperties(cheerioObj, metaProperty) {
	var properties = [];
	var selector = 'meta[property="' + metaProperty + '"]';
	cheerioObj(selector).each(function(i, element) {
		properties.push(cheerioObj(element).attr("content"));
	});
	return properties;}
Continue ReadingScraping in JavaScript using Cheerio in Loklak

Implementing Loklak APIs in Java using Reflections

Loklak server provides a large API to play with the data scraped by it. Methods in java can be implemented to use these API endpoints. A common approach of implementing the methods for using API endpoints is to create the request URL by taking the values passed to the method, and then send GET/POST request. Creating the request URL in every method can be tiresome and in the long run maintaining the library if implemented this way will require a lot of effort. For example, assume a method is to be implemented for suggest API endpoint, which has many parameters, for creating request URL a lot of conditionals needs to be written – whether a parameter is provided or not.

Well, the methods to call API endpoints can be implemented with lesser and easy to maintain code using Reflection in Java. The post ahead elaborates the problem, the approach to solve the problem and finally solution which is implemented in loklak_jlib_api.

Let’s say, the status API endpoint needs to be implemented, a simple approach can be:

public class LoklakAPI {
    
    public static String status(String baseUrl) {
        String requestUrl = baseUrl   "/api/status.json";
        
        // GET request using requestUrl 
    }

    public static void main(String[] argv) {
        JSONObject result = status("https://api.loklak.org");
    }
}

This one is easy, isn’t it, as status API endpoint requires no parameters. But just imagine if a method implements an API endpoint that has a lot of parameters, and most of them are optional parameters. As a developer, you would like to provide methods that cover all the parameters of the API endpoint. For example, how a method would look like if it implements suggest API endpoint, the old SuggestClient implementation in loklak_jlib_api does that:

public static ResultList<QueryEntry> suggest(
           final String hostServerUrl,
           final String query,
           final String source,
           final int count,
           final String order,
           final String orderBy,
           final int timezoneOffset,
           final String since,
           final String until,
           final String selectBy,
           final int random) throws JSONException, IOException {
       ResultList<QueryEntry>  resultList = new ResultList<>();
       String suggestApiUrl = hostServerUrl
               + SUGGEST_API
               + URLEncoder.encode(query.replace(' ', '+'), ENCODING)
               + PARAM_TIMEZONE_OFFSET + timezoneOffset
               + PARAM_COUNT + count
               + PARAM_SOURCE + (source == null ? PARAM_SOURCE_VALUE : source)
               + (order == null ? "" : (PARAM_ORDER + order))
               + (orderBy == null ? "" : (PARAM_ORDER_BY + orderBy))
               + (since == null ? "" : (PARAM_SINCE + since))
               + (until == null ? "" : (PARAM_UNTIL + until))
               + (selectBy == null ? "" : (PARAM_SELECT_BY + selectBy))
               + (random < 0 ? "" : (PARAM_RANDOM + random))
               + PARAM_MINIFIED + PARAM_MINIFIED_VALUE;
 
                // GET request using suggestApiUrl
   }
}

A lot of conditionals!!! The targeted users may also get irritated if they need to provide all the parameters every time even if they don’t need them. The obvious solution to that is overloading the methods. But,  then again for each overloaded method, the same repetitive conditionals need to be written, a form of code duplication!! And what if you have to implement some 30 API endpoints and in future maintaining them, a single thought of it is scary and a nightmare for the developers.

Approach to the problem

To reduce the code size, one obvious thought that comes is to implement static methods that send GET/POST requests provided request URL and the required data. These methods are implemented in loklak_jlib_api by JsonIO and NetworkIO.

You might have noticed the method name i.e. status, suggest is also the part of the request URL. So, the common pattern that we get is:

base_url + "/api/" + methodName + ".json" + "?" + firstParameterName + "=" + firstParameterValue + "&" + secondParameterName + "=" + secondParameterValue ...

Reflection API to rescue

Using Reflection API inspection of classes, interfaces, methods, and variables can be done, yes you guessed it right if we can inspect we can get the method and parameter names. It is also possible to implement interfaces and create objects at runtime.

So, an interface is created and API endpoint methods are defined in it. Using reflection API the interface is implemented lazily at runtime. LoklakAPI interface defines all the API endpoint methods. Some of the defined methods are:

public interface LoklakAPI {
 
    @Target(ElementType.METHOD)
    @Retention(RetentionPolicy.RUNTIME)
    @interface GET {}
 
    @Target(ElementType.METHOD)
    @Retention(RetentionPolicy.RUNTIME)
    @interface POST {}
 
    @GET
    JSONObject search(String q);
 
    @GET
    JSONObject search(String q, int count);
 
    @GET
    JSONObject search(String q, int timezoneOffset, String since, String until);
 
    @GET
    JSONObject search(String q, int timezoneOffset, String since, String until, int count);
 
    @GET
    JSONObject peers();
 
    @GET
    JSONObject hello();
 
    @GET
    JSONObject status();
 
    @POST
    JSONObject push(JSONObject data);
}

Here GET and POST annotations are used to mark the methods which use GET and POST request respectively, so that appropriate method for GET and POST static method can be used JsonIOloadJson for GET request and pushJson for POST request.

A private static inner class ApiInvocationHandler is created in APIGenerator, which implements InvocationHandler – used for implementing interfaces at runtime.

private static class ApiInvocationHandler implements InvocationHandler {
 
   private String mBaseUrl;
 
   public ApiInvocationHandler(String baseUrl) {
       this.mBaseUrl = baseUrl;
   }
 
   @Override
   public Object invoke(Object o, Method method, Object[] values) throws Throwable {
       Parameter[] params = method.getParameters();
       Object[] paramValues = values;
 
       /*
       format of annotation name:
       @packageWhereAnnotationIsDeclared$AnnotationName()
       Example: @org.loklak.client.LoklakAPI$GET()
       */
       Annotation annotation = method.getAnnotations()[0];
       String annotationName = annotation.toString().toLowerCase();
 
       String apiUrl = createGetRequestUrl(mBaseUrl, method.getName(),
               params, paramValues);
       if (annotationName.contains("get")) { // GET REQUEST
           return loadJson(apiUrl);
       } else { // POST REQUEST
           JSONObject jsonObjectToPush = (JSONObject) paramValues[0];
           String postRequestUrl = createPostRequestUrl(mBaseUrl, method.getName());
           return pushJson(postRequestUrl, params[0].getName(), jsonObjectToPush);
       }
   }
}

The base URL is provided while creating the object, in the constructor. The invoke method is called whenever a defined method in the interface is called in actual code. This way an interface is implemented at runtime. The parameters of invoke method:

  • Object o – the object on which to call the method.
  • Method method – method which is called, parameters and annotations of the method are obtained using getParameters and getAnnotations methods respectively which are required to create our request url.
  • Object[] values – an array of parameter values which are passed to the method when calling it.

createGetRequestUrl is used to create the request URL for sending GET requests.

private static String createGetRequestUrl(
       String baseUrl, String methodName, Parameter[] params, Object[] paramValues) {
   String apiEndpointUrl = baseUrl.replaceAll("/+$", "") + "/api/" + methodName + ".json";
   StringBuilder url = new StringBuilder(apiEndpointUrl);
   if (params.length > 0) {
       String queryParamAndVal = "?" + params[0].getName() + "=" + paramValues[0];
       url.append(queryParamAndVal);
       for (int i = 1; i < params.length; i++) {
           String paramAndVal = "&" + params[i].getName()
                   + "=" + String.valueOf(paramValues[i]);
           url.append(paramAndVal);
       }
   }
   return url.toString();
}

Similarly, createPostRequestUrl is used to create request URL for sending POST requests.

private static String createPostRequestUrl(String baseUrl, String methodName) {
   baseUrl = baseUrl.replaceAll("/+$", "");
   return baseUrl + "/api/" + methodName + ".json";
}

Finally, Proxy.newProxyInstance is used to implement the interface at runtime. An instance of ApiInvocationHandler class is passed to the newProxyInstance method. This is all done by static method createApiMethods.

public static <T> T createApiMethods(Class<T> service, final String baseUrl) {
   ApiInvocationHandler apiInvocationHandler = new ApiInvocationHandler(baseUrl);
   return (T) Proxy.newProxyInstance(
           service.getClassLoader(), new Class<?>[]{service}, apiInvocationHandler);
}

Where:

  • Class<T> service – is the interface where API endpoint methods are defined, here LoklakAPI.class.
  • String baseUrl – web address of the server where Loklak server is hosted.

NOTE: For all this to work the library must be build using “-parameters” flag, else parameter names can’t be obtained. Since loklak_jlib_api uses maven, “-parameters” is provided in pom.xml file.

A small example:

String baseUrl = "https://api.loklak.org";
LoklakAPI loklakAPI = APIGenerator.createApiMethods(LoklakAPI.class, baseUrl);
 
JSONObject searchResult = loklakAPI.search("FOSSAsia");
Continue ReadingImplementing Loklak APIs in Java using Reflections