Vibhor Verma – blog.fossasia.org

Indexing for multiscrapers in Loklak Server

Post author:Vibhor Verma
Post published:September 4, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

I recently added multiscraper system which can scrape data from web-scrapers like YoutubeScraper, QuoraScraper, GithubScraper, etc. As scraping is a costly task, it is important to improve it’s efficiency. One of the approach is to index data in cache. TwitterScraper uses multiple sources to optimize the efficiency.

This system uses Post message holder object to store data and PostTimeline (a specialized iterator) to iterate the data objects. This difference in data structures from TwitterScraper leads to the need of different approach to implement indexing of data to ElasticSearch (currently in review process).

These are the following changes I made while implementing ‘indexing of data’ in the project.

1) Writing of data is invoked only using PostTimeline iterator

In TwitterScraper, the data is written in message holder TwitterTweet. So all the tweets are written to index as they are created. Here, when the data is scraped, Writing of the posts is initiated. Scraping of data is considered a heavy process. This approach keeps lower resource usage in average traffic on the server.

protected Post putData(Post typeArray, String key, Timeline2 postList) {
   if(!"cache".equals(this.source)) {
       postList.writeToIndex();
   }
   return this.putData(typeArray, key, postList.toArray());
}

2) One object for holding a message

During the implementation, I kept the same message holder Post and post-iterator PostTimeline from scraping to indexing of data. This helps to keep the structure uniform. Earlier approach involves different types of message wrappers in the way. This approach cuts the processes for looping and transitioning of data structures.

3) Index a list, not a message

In TwitterScraper, as the messages are enqueued in the bulk to be indexed. But in this approach, I have enqueued the complete lists. This approach delays the indexing till the scraper is done with processing the html.

Creating the queue of postlists:

// Add post-lists to queue to be indexed
queueClients.incrementAndGet();
try {
    postQueue.put(postList);
} catch (InterruptedException e) {
DAO.severe(e);
}
queueClients.decrementAndGet();

Indexing of the posts in postlists:

// Start indexing of data in post-lists
for (Timeline2 postList: postBulk) {
    if (postList.size() < 1) continue;
    if(postList.dump) {
        // Dumping of data in a file
        writeMessageBulkDump(postList);
    }
    // Indexing of data to ElasticSearch
    writeMessageBulkNoDump(postList);
}

4) Categorizing the input parameters

While searching the index, I have divided the query parameters from scraper into 3 categories. The input parameters are added to those categories (implemented using map data structure) and thus data fetched are according to them. These categories are:

// Declaring the QueryBuilder
BoolQueryBuilder query = new BoolQueryBuilder();

a) Get the parameter– Get the results for the input fields in map getMap.

// Result must have these fields. Acts as AND operator
if(getMap != null) {
    for(Map.Entry<String, String> field : getMap.entrySet()) {
        query.must(QueryBuilders.termQuery(
field.getKey(), field.getValue()));
    }
}

b) Don’t get the parameter- Don’t get the results for the input fields in map notGetMap.

// Result must not have these fields.
if(notGetMap != null) {
    for(Map.Entry<String, String> field : notGetMap.entrySet()) {
        query.mustNot(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));
    }
}

c) Get if possible- Get the results with the input fields if they are present in the index.

// Result may preferably also get these fields. Acts as OR operator
if(mayAlsoGetMap != null) {
    for(Map.Entry<String, String> field : mayAlsoGetMap.entrySet()) {
        query.should(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));

    }
}

By applying these changes, the scrapers are shifted from a message indexing to list of messages indexing. This way we are keeping load on RAM low, but the aggregation of latest scraped data may be affected. So there will be a need to workaround to solve this issue while scraping itself.

References

Match query with “operator”:”and” via the Java API: https://discuss.elastic.co/t/match-query-with-operator-and-via-the-java-api/67863/2
How to use BoolQueryBuilder: https://stackoverflow.com/questions/40923945/how-to-add-bool-query-inside-a-should-must-method-in-java-api

Setting Loklak Server with SSL

Post author:Vibhor Verma
Post published:September 2, 2017
Post category:FOSSASIA loklak Open Event
Post comments:0 Comments

Loklak Server is based on embedded Jetty Server which can work both with or without SSL encryption. Lately, there was need to setup Loklak Server with SSL. Though the need was satisfied by CloudFlare. Alternatively, there are 2 ways to set up Loklak Server with SSL. They are:-

1) Default Jetty Implementation

There is pre-existing implementation of Jetty libraries. The http mode can be set in configuration file. There are 4 modes on which Loklak Server can work: http mode, https mode, only https mode and redirect to https mode. Loklak Server listens to port 9000 when in http mode and to port 9443 when in https mode.

There is also a need of SSL certificate which is to be added in configuration file.

2) Getting SSL certificate with Kube-Lego on Kubernetes Deployment

I got to know about Kube-Lego by @niranjan94. It is implemented in Open-Event-Orga-Server. The approach is to use:

a) Nginx as ingress controller

For setting up Nginx ingress controller, a yml file is needed which downloads and configures the server.

The configurations for data requests and response are:

proxy-connect-timeout: "15"
 proxy-read-timeout: "600"
 proxy-send-imeout: "600"
 hsts-include-subdomains: "false"
 body-size: "64m"
 server-name-hash-bucket-size: "256"
 server-tokens: "false"

Nginx is configured to work on both http and https ports in service.yml

ports:
- port: 80
  name: http
- port: 443
  name: https

b) Kube-Lego for fetching SSL certificates from Let’s Encrypt

Kube-Lego was set up with default values in yml. It uses the host-name, email address and secretname of the deployment to validate url and fetch SSL certificate from Let’s Encrypt.

c) Setup configurations related to TLS and no-TLS connection

These configuration files mentions the path and service ports for Nginx Server through which requests are forwarded to backend Loklak Server. Here for no-TLS and TLS requests, the requests are directly forwarded to localhost at port 80 of Loklak Server container.

rules:
- host: staging.loklak.org
  http:
  paths:
  - path: /
    backend:
    serviceName: server
    servicePort: 80

For TLS requests, the secret name is also mentioned. Kube-Lego fetches host-name and secret-name from here for the certificate

tls:
- hosts:
- staging.loklak.org
  secretName: loklak-api-tls

d) Loklak Server, ElasticSearch and Mosquitto at backend

These containers work at backend. ElasticSearch and Mosquitto are only accessible to Loklak Server. Loklak Server can be accessed through Nginx server. Loklak Server is configured to work at http mode and is exposed at port 80.

ports:
- port: 80
  protocol: TCP
  targetPort: 80

To deploy the Loklak Server, all these are deployed in separate pods and they interact through service ports. To deploy, we use deploy script:

# For elasticsearch, accessible only to api-server
kubectl create -R -f ${path-to-config-file}/elasticsearch/

# For mqtt, accessible only to api-server
kubectl create -R -f ${path-to-config-file}/mosquitto/

# Start KubeLego deployment for TLS certificates
kubectl create -R -f ${path-to-config-file}/lego/
kubectl create -R -f ${path-to-config-file}/nginx/

# Create web namespace, this acts as bridge to Loklak Server
kubectl create -R -f ${path-to-config-file}/web/

# Create API server deployment and expose the services
kubectl create -R -f ${path-to-config-file}/api-server/

# Get the IP address of the deployment to be used
kubectl get services --namespace=nginx-ingress

References

kube-lego with GCE ingress controller: https://github.com/jetstack/kube-lego/tree/master/examples/gce
What’s the difference between SSL, TLS, and HTTPS: https://security.stackexchange.com/questions/5126/whats-the-difference-between-ssl-tls-and-https
Standalone HTTPS with Jetty: https://wiki.opennms.org/wiki/Standalone_HTTPS_with_Jetty

Fetching Metadata in Loklak Server

Post author:Vibhor Verma
Post published:September 1, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

In Loklak Server multiscrapers are working fine but there was a need to setup metadata framework to be embedded with the data. Metadata outputs the parameters passed, number of hits on the webpage to fetch results and number of results outputted.

There is no metadata framework for TwitterScraper. Metadata is collected but there are 2 issues:

1) the metadata fields are directly feeded while outputing data.

2) Every Scraper had different metadata fields or none.

To improve this for multiscraper system, I embedded metadata by configuring in the BaseScraper class and in PostTimeline iterator. If the metadata is directly collected in BaseScraper itself, then it will become non-of-developer-concern while working on scrapers and he can concentrate on improving scrapers.

These are the following changes I made in code:

1) Input Get-Parameters

For scrapers, one of the metadata field was input parameters. I directly added them in metadata block.

protected Post getMetadata() {
    Post metadata = new Post(true);
    metadata.put("hits", this.hits);
    metadata.put("count", this.count);
    metadata.put("scraper", this.scraperName);
    metadata.put("input_parameters", this.extra);
    return metadata;
}

2) Hits and Counts

Hits refer to number of times Loklak Server made a hit to the target website where as Counts refer to number of posts scraped by the scraper. To fetching these data was easy.

For count, I added a method putData in BaseScraper. It shall be used to create list of posts instead of directly creating the list. Here I have added counter which counts the posts.

protected Post putData(Post typeArray, String key, JSONArray postList) {
    this.count = this.count + postList.length();
    typeArray.put(key, postList);
    return typeArray;
}

For hits, I just counted the number of times the URL was fed into ClientConnection method.

public Post getDataFromConnection(String url, String type) throws IOException {
// This adds to hits count even if connection fails
    this.hits++;
    ClientConnection connection = new ClientConnection(url);
.
.
.

3) For multiscrapers in Search Endpoint

This was a bit tricky task. For creating metadata block for all the scrapers, I had to fetch metadata block of all the scrapers, process them and then output with the results. I added this to PostTimeline iterator and implemented in a loop when a scraper outputs data.

public void collectMetadata(JSONObject metadata) {
    // INITIALIZE PARAMETERS
    int hits = 0;
    int count = 0;
    Set scrapers = new HashSet<String>();

    // GET LIST OF KEYS IN SCRAPER
    List<String> listKeys = new ArrayList<String>(this.posts.keySet());
    int n = listKeys.size();

    for (int i = 0; i < n; i++) {
        // FETCH METADATA POST FROM SCRAPED DATA
        Post postMetadata = (Post) this.posts.get(listKeys.get(i)).get("metadata");
        hits = hits + Integer.parseInt(String.valueOf(postMetadata.get("hits")));
        count = count + Integer.parseInt(String.valueOf(postMetadata.get("count")));
        scrapers.add(postMetadata.get("scraper"));
    }

    // SET OUTPUT
    metadata.put("hits", hits);
    metadata.put("count", count);
    metadata.put("scraper_count", scrapers.size());
    metadata.put("scrapers", scrapers);
}

References

Crawlers and Metadata Extraction (Stuff that needs to be solved): https://vimeo.com/53109189
Why Metadata? https://www.villanovau.com/resources/bi/metadata-importance-in-data-driven-world/#.WZmbMKvhXeQ

Backend Scraping in Loklak Server

Post author:Vibhor Verma
Post published:August 20, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

Loklak Server is a peer-to-peer Distributed Scraping System. It scrapes data from websites and also maintain other sources like peers, storage and a backend server to scrape data. Maintaining different sources has it’s benefits of not engaging in costly requests to the websites, no scraping of data and no cleaning of data.

Loklak Server can maintain a secondary Loklak Server (or a backend server) tuned for storing large amount of data. This enables the primary Loklak Server fetch data in return of pushing all scraped data to the backend.

Lately there was a bug in backend search as a new feature of filtering tweets was added to scraping and indexing, but not for backend search. To fix this issue, I had backtracked the backend search codebase and fix it.

Let us discuss how Backend Search works:-

1) When query is made from search endpoint with:

a) source=all

When source is set to all. The first TwitterScraper and Messages from local search server is preferred. If the messages scraped are not enough or no output has been returned for a specific amount of time, then, backend search is initiated

b) source=backend

SearchServlet specifically scrapes directly from backend server.

2) Fetching data from Backend Server

The input parameters fetched from the client is feeded into DAO.searchBackend method. The list of backend servers fetched from config file. Now using these input parameters and backend servers, the required data is scraped and output to the client.

In DAO.searchOnOtherPeers method. the request is sent to multiple servers and they are arranged in order of better response rates. This method invokes SearchServlet.search method for sending request to the mentioned servers.

List<String> remote = getBackendPeers();
if (remote.size() > 0) {
    // condition deactivated because we need always at least one peer
    Timeline tt = searchOnOtherPeers(remote, q, filterList, order, count, timezoneOffset, where, SearchServlet.backend_hash, timeout);
    if (tt != null) tt.writeToIndex();
    return tt;
}

3) Creation of request url and sending requests

The request url is created according to the input parameters passed to SearchServlet.search method. Here the search url is created according to input parameters and request is sent to the respective servers to fetch the required messages.

   // URL creation
    urlstring = protocolhostportstub + "/api/search.json?q="
           + URLEncoder.encode(query.replace(' ', '+'), "UTF-8") + "&timezoneOffset="
           + timezoneOffset + "&maximumRecords=" + count + "&source="
           + (source == null ? "all" : source) + "&minified=true&shortlink=false&timeout="
           + timeout;
    if(!"".equals(filterString = String.join(", ", filterList))) {
       urlstring = urlstring + "&filter=" + filterString;
    }
    // Download data
    byte[] jsonb = ClientConnection.downloadPeer(urlstring);
    if (jsonb == null || jsonb.length == 0) throw new IOException("empty content from " + protocolhostportstub);
    String jsons = UTF8.String(jsonb);
    JSONObject json = new JSONObject(jsons);
    if (json == null || json.length() == 0) return tl;
    // Final data fetched to be returned
    JSONArray statuses = json.getJSONArray("statuses");

References

Social peer-to-peer processes: https://en.wikipedia.org/wiki/Social_peer-to-peer_processes
Parallel Random-Access Machine: http://pages.cs.wisc.edu/~tvrdik/2/html/Section2.html
Distributed Algorithm (Cole–Vishkin algorithm): http://homepage.divms.uiowa.edu/~ghosh/color.pdf

Configuring Youtube Scraper with Search Endpoint in Loklak Server

Post author:Vibhor Verma
Post published:August 20, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

Youtube Scraper is one of the interesting web scrapers of Loklak Server with unique implementation of its data scraping and data key creation (using RDF). It couldn’t be accessed as it didn’t have any url endpoint. I configured it to use both as separate endpoint (api/youtubescraper) and search endpoint (/api/search.json).

Usage:

YoutubeScraper Endpoint: /api/youtubescraperExample:http://api.loklak.org/api/youtubescraper?query=https://www.youtube.com/watch?v=xZ-m55K3FhQ&scraper=youtube
SearchServlet Endpoint: /api/search.json

Example: http://api.loklak.org/api/search.json?query=https://www.youtube.com/watch?v=xZ-m55K3FhQ&scraper=youtube

The configurations added in Loklak Server are:-

1) Endpoint

We can access YoutubeScraper using endpoint /api/youtubescraper endpoint. Like other scrapers, I have used BaseScraper class as superclass for this functionality .

2) PrepareSearchUrl

The prepareSearchUrl method creates youtube search url that is used to scrape Youtube webpage. YoutubeScraper takes url as input. But youtube link could also be a shortened link. That is why, the video id is stored as query. This approach optimizes the scraper and adds the capability to add more scrapers to it.

Currently YoutubeScraper scrapes the video webpages of Youtube, but scrapers for search webpage and channel webpages can also be added.

URIBuilder url = null;
String midUrl = "search/";
    try {
       switch(type) {
           case "search":
               midUrl = "search/";
               url = new URIBuilder(this.baseUrl + midUrl);
               url.addParameter("search_query", this.query);
               break;
           case "video":
               midUrl = "watch/";
               url = new URIBuilder(this.baseUrl + midUrl);
               url.addParameter("v", this.query);
               break;
           case "user":
               midUrl = "channel/";
               url = new URIBuilder(this.baseUrl + midUrl + this.query);
               break;
           default:
               url = new URIBuilder("");
               break;
       }
    } catch (URISyntaxException e) {
       DAO.log("Invalid Url: baseUrl = " + this.baseUrl + ", mid-URL = " + midUrl + "query = " + this.query + "type = " + type);
       return "";
    }

3) Get-Data-From-Connection

The getDataFromConnection method is used to fetch Bufferedreader object and input it to scrape method. In YoutubeScraper, this method has been overrided to prevent using default method implementation i.e. use type=all

@Override
public Post getDataFromConnection() throws IOException {
    String url = this.prepareSearchUrl(this.type);
    return getDataFromConnection(url, this.type);
}

4) Set scraper parameters input as get-parameters

The Map data-structure of get-parameters fetched by scraper fetches type and query. For URL, the video hash-code is separated from url and then used as query.

this.query = this.getExtraValue("query");
this.query = this.query.substring(this.query.length() - 11);

5) Scrape Method

Scrape method runs the different scraper methods (in YoutubeScraper, there is only one), iterate it using PostTimeline and wraps in Post object to the output. This simple function can improve flexibility of scraper to scrape different pages concurrently.

Post out = new Post(true);
Timeline2 postList = new Timeline2(this.order);
postList.addPost(this.parseVideo(br, type, url));
out.put("videos", postList.toArray());

References

What is an RDF triple explained on Stackoverflow: https://stackoverflow.com/questions/273218/whats-a-rdf-triple
Tutorial on Scraping with Regular Expressions: http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_PDFsText_Files_in_Python_Using_Regular_Expressions.php
Youtube Video-Id Format: https://webapps.stackexchange.com/questions/54443/format-for-id-of-youtube-video

Scraping Concurrently with Loklak Server

Post author:Vibhor Verma
Post published:August 7, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

At Present, SearchScraper in Loklak Server uses numerous threads to scrape Twitter website. The data fetched is cleaned and more data is extracted from it. But just scraping Twitter is under-performance.

Concurrent scraping of other websites like Quora, Youtube, Github, etc can be added to diversify the application. In this way, single endpoint search.json can serve multiple services.

As this Feature is under-refinement, We will discuss only the basic structure of the system with new changes. I tried to implement more abstract way of Scraping by:-

1) Fetching the input data in SearchServlet

Instead of selecting the input get-parameters and referencing them to be used, Now complete Map object is referenced, helping to be able to add more functionality based on input get-parameters. The dataArray object (as JSONArray) is fetched from DAO.scrapeLoklak method and is embedded in output with key results

    // start a scraper
    inputMap.put("query", query);
    DAO.log(request.getServletPath() + " scraping with query: "
           + query + " scraper: " + scraper);
    dataArray = DAO.scrapeLoklak(inputMap, true, true);

2) Scraping the selected Scrapers concurrently

In DAO.java, the useful get parameters of inputMap are fetched and cleaned. They are used to choose the scrapers that shall be scraped, using getScraperObjects() method.

Timeline2.Order order= getOrder(inputMap.get("order"));
Timeline2 dataSet = new Timeline2(order);
List<String> scraperList = Arrays.asList(inputMap.get("scraper").trim().split("\\s*,\\s*"));

Threads are created to fetch data from different scrapers according to size of list of scraper objects fetched. input map is passed as argument to the scrapers for further get parameters related to them and output data according to them.

List<BaseScraper> scraperObjList = getScraperObjects(scraperList, inputMap);
ExecutorService scraperRunner = Executors.newFixedThreadPool(scraperObjList.size());

try{
    for (BaseScraper scraper : scraperObjList)
    {
        scraperRunner.execute(() -> {
            dataSet.mergePost(scraper.getData());
        });

    }

} finally {
    scraperRunner.shutdown();

    try {
        scraperRunner.awaitTermination(24L, TimeUnit.HOURS);
    } catch (InterruptedException e) { }
}

3) Fetching the selected Scraper Objects in DAO.java

Here the variable of abstract class BaseScraper (SuperClass of all search scrapers) is used to create List of scrapers to be scraped. All the scrapers’ constructors are fed with input map to be scraped accordingly.

List<BaseScraper> scraperObjList = new ArrayList<BaseScraper>();
BaseScraper scraperObj = null;

if (scraperList.contains("github") || scraperList.contains("all")) {
    scraperObj = new GithubProfileScraper(inputMap);
    scraperObjList.add(scraperObj);
}
.
.
.

References:

Best practices of Multithreading in Java: https://stackoverflow.com/questions/17018507/java-multithreading-best-practice
ExecutorService vs Casual Thread Spawner: https://stackoverflow.com/questions/26938210/executorservice-vs-casual-thread-spawner
Basic Data Structures used in Java: https://www.eduonix.com/blog/java-programming-2/learn-to-implement-data-structures-in-java/

Data Indexing in Loklak Server

Post author:Vibhor Verma
Post published:July 29, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

Loklak Server is a data-scraping system that indexes all the scraped data for the purpose to optimize it. The data fetched by different users is stored as cache. This helps in retrieving of data directly from cache for recurring queries. When users search for the same queries, load on Loklak Server is reduced by outputting indexed data, thus optimizing the operations.

Application

It is dependent on ElasticSearch for indexing of cached data (as JSON). The data that is fetched by different users is stored as cache. This helps in fetching data directly from cache for same queries. When users search for the same queries, load on Loklak Server is reduced and it is optimized by outputting indexed data instead of scraping the same date again.

When is data indexing done?

The indexing of data is done when:

1) Data is scraped:

When data is scraped, data is indexed concurrently while cleaning of data in TwitterTweet data object. For this task, addScheduler static method of IncomingMessageBuffer is used, which acts as

abstract between scraping of data and storing and indexing of data.

The following is the implementation from TwitterScraper (from here). Here writeToIndex is the boolean input to whether index the data or not.

if (this.writeToIndex) IncomingMessageBuffer.addScheduler(this, this.user, true);

2) Data is fetched from backend:

When data is fetched from backend, it is indexed in Timeline iterator. It calls the above method to index data concurrently.

The following is the definition of writeToIndex() method from Timeline.java (from here). When writeToIndex() is called, the fetched data is indexed.

public void writeToIndex() {
    IncomingMessageBuffer.addScheduler(this, true);
}

How?

When addScheduler static method of IncomingMessageBuffer is called, a thread is started that indexes all data. When the messagequeue data structure is filled with some messages, indexing continues.

See here . The DAO method writeMessageBulk is called here to write data. The data is then written to the following streams:

1) Dump: The data fetched is dumped into Import directory in a file. It can also be fetched from other peers.

2) Index: The data fetched is checked if it exists in the index and data that isn’t indexed is indexed.

public static Set<String> writeMessageBulk(Collection<MessageWrapper> mws) {
    List<MessageWrapper> noDump = new ArrayList<>();
    List<MessageWrapper> dump = new ArrayList<>();
    for (MessageWrapper mw: mws) {
        if (mw.t == null) continue;
        if (mw.dump) dump.add(mw);
        else noDump.add(mw);
    }

    Set<String> createdIDs = new HashSet<>();
    createdIDs.addAll(writeMessageBulkNoDump(noDump));
    createdIDs.addAll(writeMessageBulkDump(dump));

    // Does also do an writeMessageBulkNoDump internally
    return createdIDs;
}

The above code snippet is from DAO.java, method calls writeMessageBulkNoDump(noDump) indexes the data to ElasticSearch. The definition of this method can be seen here

Whereas for dumping of data writeMessageBulkDump(Dump) is called. It is defined here

Resources:

Iterable: https://docs.oracle.com/javase/8/docs/api/java/lang/Iterable.html
Use of Iterable: https://stackoverflow.com/questions/1059127/what-is-the-iterable-interface-used-for
ElasticSearch Webinar: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1
Ways to iterate through loop: https://crunchify.com/how-to-iterate-through-java-list-4-way-to-iterate-through-loop/

Some Other Services in Loklak Server

Post author:Vibhor Verma
Post published:July 29, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

Loklak Server isn’t just a scraper system software, it provides numerous other services to perform other interesting functions like Link Unshortening (reverse of link shortening) and video fetching and administrative tasks like status fetching of the Loklak deployment (for analysis in Loklak development use) and many more. Some of these are internally implemented and rest can be used through http endpoints. Also there are some services which aren’t complete and are in development stage.

Let’s go through some of them to know a bit about them and how they can be used.

1) VideoUrlService

This is the service to extract video from the website that has a streaming video and output the video file link. This service is in development stage and is functional. Presently, It can fetch twitter video links and output them with different video qualities.

Endpoint: /api/videoUrlService.json

Implementation Example:

curl api/loklak.org/api/videoUrlService.json?id=https://twitter.com/EXOGlobal/status/886182766970257409&id=https://twitter.com/KMbappe/status/885963850708865025

2) Link Unshortening Service

This is the service used to unshorten the link. There are shortened URLs which are used to track the Internet Users by Websites. To prevent this, link unshortening service unshortens the link and returns the final untrackable link to the user.

Currently this service is in application in TwitterScraper to unshorten the fetched URLs. It has other methods to get Redirect Link and also a link to get final URL from multiple unshortened link.

Implementation Example from TwitterScraper.java [LINK]:

Matcher m = timeline_link_pattern.matcher(text);

if (m.find()) {
    String expanded = RedirectUnshortener.unShorten(m.group(2));
    text = m.replaceFirst(" " + expanded);
    continue;
}

Further it can be used to as a service and can be used directly. New features like fetching featured image from links can be added to this service. Though these stuff are in discussion and enthusiastic contribution is most welcomed.

3) StatusService

This is a service that outputs all data related to to Loklak Server deployment’s configurations. To access this configuration, api endpoint status.json is used.

It outputs the following data:

a) About the number of messages it scrapes in an interval of a second, a minute, an hour, a day, etc.

b) The configuration of the server like RAM, assigned memory, used memory, number of cores of CPU, cpu load, etc.

c) And other configurations related to the application like size of ElasticSearch shards size and their specifications, client request header, number of running threads, etc.

Endpoint: /api/status.json

Implementation Example:

curl api/loklak.org/api/status.json

Resources:

Code URL Shortener: https://stackoverflow.com/questions/742013/how-to-code-a-url-shortener
URL Shortening-Hashing in Practice: https://blog.codinghorror.com/url-shortening-hashes-in-practice/
ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1
M3U8 format: https://www.lifewire.com/m3u8-file-2621956
Fetch Video using PHP: https://stackoverflow.com/questions/10896233/how-can-i-retrieve-youtube-video-details-from-video-url-using-php

Simplifying Scrapers using BaseScraper

Post author:Vibhor Verma
Post published:July 18, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Loklak Server‘s main function is to scrape data from websites and other sources and output in different formats like JSON, xml and rss. There are many scrapers in the project that scrape data and output them, but are implemented with different design and libraries which makes them different from each other and a difficult to fix changes.

Due to variation in scrapers’ design, it is difficult to modify them and fix the same issue (any issue, if it appears) in each of them. This issue signals fault in design. To solve this problem, Inheritance can be brought into application. Thus, I created BaseScraper abstract class so that scrapers are more concentrated on fetching data from HTML and all supportive tasks like creating connection with the help of url are defined in BaseScraper.

The concept is pretty easy to implement, but for a perfect implementation, there is a need to go through the complete list of tasks a scraper does.

These are the following tasks with descriptions and how they are implemented using BaseScraper:

Endpoint that triggers the scraper

Every search scraper inherits class AbstractAPIHandler. This is used to fetch get parameters from the endpoint according to which data is scraped from the scraper. The arguments from serviceImpl method is used to generate output and is returned to it as JSONObject.

For this task, the method serviceImpl has been defined in BaseScraper and method getData is implemented to return the output. This method is the driver method of the scraper.

public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights, JSONObjectWithDefault permissions) throws APIException {
    this.setExtra(call);
    return this.getData().toJSON(false, "metadata", "posts");
}

Constructor

The constructor of Scraper defines the base URL of the website to be scraped, name of the scraper and data structure to fetch all get parameters input to the scraper. For get parameters, the Map data structure is used to fetch them from Query object.

Since every scraper has it’s own different base URL, scraper name and get parameters used, so it is implemented in respective Scrapers. QuoraProfileScraper is an example which has these variables defined.

Get all input variables

To get all input variables, there are setters and getters defined for fetching them as Map from Query object in BaseScraper. There is also an abstract method getParam(). It is defined in respective scrapers to fetch the useful parameters for scraper and set them to the scraper’s class variables.

// Setter for get parameters from call object
protected void setExtra(Query call) {
    this.extra = call.getMap();
    this.query = call.get("query", "");
    this.setParam();
}

// Getter for get parameter wrt to its key
public String getExtraValue(String key) {
    String value = "";
    if(this.extra.get(key) != null) {
        value = this.extra.get(key).trim();
    }
    return value;
}

// Defination in QuoraProfileScraper
protected void setParam() {
    if(!"".equals(this.getExtraValue("type"))) {
        this.typeList = Arrays.asList(this.getExtraValue("type").trim().split("\\s*,\\s*"));
    } else {
        this.typeList = new ArrayList<String>();
        this.typeList.add("all");
        this.setExtraValue("type", String.join(",", this.typeList));
    }
}

URL creation for web scraper

The URL creation shall be implemented in a separate method as in TwitterScraper. The following is the rough implementation adapted from one of my pull request:

protected String prepareSearchUrl(String type) {
    URIBuilder url = null;
    String midUrl = "search/";

    try {
        switch(type) {
            case "question":
                url = new URIBuilder(this.baseUrl + midUrl);
                url.addParameter("q", this.query);
                url.addParameter("type", "question");
        .
        .
    }
    .
    .
    return url.toString();
}

Get BufferedReader object from InputStream

getDataFromConnection method fetches the BufferedReader object from ClientConnection. This object reads the web page line by line by the scrape method to fetch data. See here.

ClientConnection connection = new ClientConnection(url);
BufferedReader br = getHtml(connection);
.
.
.
public BufferedReader getHtml(ClientConnection connection) {

    if (connection.inputStream == null) {
        return null;
    }

    BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
    return br;
}

Scraping of data from HTML

The Scraper method for scraping data is declared abstract in BaseScraper and defined in the scraper. This can be a perfect example of implementation for BaseScraper (See code the here) and scraper (here).

Output of data

The output of scrape method is fetched in Post data objects that are implemented for the respective scraper. These Post objects are added to Timeline iterator and which outputs data as JSONArray. Later the objects are output in enclosed Post object wrapper.

This data can be directly output as Post object, but adding it to iterator makes the Post Objects capable to be sorted in an order and be indexed to ElasticSearch.

Resources

Loklak Server: https://github.com/loklak/loklak_server
ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1

Iterating the Loklak Server data

Post author:Vibhor Verma
Post published:July 18, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Loklak Server is amazing for what it does, but it is more impressive how it does the tasks. Iterators are used for and how to use them, but this project has a customized iterator that iterates Twitter data objects. This iterator is Timeline.java .

Timeline implements an interface iterable (isn’t it iterator?). This interface helps in using Timeline as an iterator and add methods to modify, use or create the data objects. At present, it only iterates Twitter data objects. I am working on it to modify it to iterate data objects from all web scrapers.

The following is a simple example of how an iterator is used.

// Initializing arraylist
List<String> stringsList = Arrays.asList("foo", "bar", "baz");

// Using iterator to display contents of stringsList
System.out.print("Contents of stringsList: ");

Iterator iter = al.iterator();
while(iter.hasNext()) {
    System.out.print(iter.next() + " ");
}

This iterator can only iterate data the way array does. (Then why do we need it?) It does the task of iterating objects perfectly, but we can add more functionality to the iterator.

Timeline iterator iterates the MessageEntry objects i.e. superclass of TwitterTweet objects. According to Javadocs, “Timeline is a structure which holds tweet for the purpose of presentation, There is no tweet retrieval method here, just an iterator which returns the tweets in reverse appearing order.”

Following are some of the tasks it does:

As an iterator:

This basic use of Timeline is to iterate the MessageEntry objects. It not only iterates the data objects, but also fetches them (See here).

// Declare Timeline object according to order the data object has been created
Timeline tline = new Timeline(Timeline.parseOrder("created_at"));

// Adding data objects to the timeline
tline.add(me1);
tline.add(me2);
.
.
.
// Outputing all data objects as array of JSON objects
for (MessageEntry me: tline) {
    JSONArray postArray = new JSONArray();
    for (MessageEntry post : this) {
        postArray.put(post.toJSON());
    }
}

The order of iterating the data objects

Timeline can arrange and iterate the data objects according to the date of creation of the twitter post, number of retweets or number of favourite counts. For this there is an Enum declaration of Order in the Timeline class which is initialized during creation of Timeline object. [link]

    Timeline tline = new Timeline(Timeline.parseOrder("created_at"));

Pagination of data objects

There is an object cursor, some methods, including getter and setters to support pagination of the data objects. It is only internally implemented, but can also be used to return a section of the result.

writeToIndex method

This method can be used to write all data fetched by Timeline iterator to ElasticSearch for indexing and to dump that can be used for testing. Thus, indexing of data can concurrently be done while it is iterated. It is implemented here.

Other methods

It also has methods to output all data as JSON and customized method to add data to Timeline keeping user object and Data separate, etc. There are a bit more things in this iterable class which shall be explored instead.

Resources:

Loklak Server: https://github.com/loklak/loklak_server
Iterable: https://docs.oracle.com/javase/8/docs/api/java/lang/Iterable.html
Use of Iterable: https://stackoverflow.com/questions/1059127/what-is-the-iterable-interface-used-for
ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1
Ways to iterate through loop: https://crunchify.com/how-to-iterate-through-java-list-4-way-to-iterate-through-loop/

1) Writing of data is invoked only using PostTimeline iterator

2) One object for holding a message

3) Index a list, not a message

4) Categorizing the input parameters

References

1) Default Jetty Implementation

2) Getting SSL certificate with Kube-Lego on Kubernetes Deployment

a) Nginx as ingress controller

References

1) Input Get-Parameters

2) Hits and Counts

3) For multiscrapers in Search Endpoint

References

References

Usage:

The configurations added in Loklak Server are:-

1) Endpoint

2) PrepareSearchUrl

3) Get-Data-From-Connection

4) Set scraper parameters input as get-parameters

5) Scrape Method

References

1) Fetching the input data in SearchServlet

2) Scraping the selected Scrapers concurrently

3) Fetching the selected Scraper Objects in DAO.java

References:

Application

When is data indexing done?

1) Data is scraped:

2) Data is fetched from backend:

How?

Resources:

1) VideoUrlService

2) Link Unshortening Service

3) StatusService

Resources:

Endpoint that triggers the scraper

Constructor

Get all input variables

URL creation for web scraper

Get BufferedReader object from InputStream

Scraping of data from HTML

Output of data

Resources

As an iterator:

The order of iterating the data objects

Pagination of data objects

writeToIndex method

Other methods

Resources: