loklak server | blog.fossasia.org

Adding Push endpoint to send data from Loklak Search to Loklak Server

Post author:simsausaurabh
Post published:June 23, 2018
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

To provide enriched and sufficient amount of data to Loklak, Loklak Server should have multiple sources of data. The api/push.json endpoint of loklak server is used in Loklak to post the search result object to server. It will increase the amount and quality of data on server once the Twitter api is supported by Loklak (Work is in progress to add support for twitter api in loklak).

Creating Push Service

The idea is to create a separate service for making a Post request to server. First step would be to create a new ‘PushService’ under ‘services/’ using:

ng g service services/push

Creating model for Push Api Response

Before starting to write code for push service, create a new model for the type of response data obtained from Post request to ‘api/push.json’. For this, create a new file push.ts under ‘models/’ with the code given below and export the respective push interface method in index file.

export interface PushApiResponse {
   status: string;
   records: number;
   mps: number;
   message: string;
}

Writing Post request in Push Service

Next step would be to create a Post request to api/push.json using HttpClient module. Import necessary dependencies and create an object of HttpClient module in constructor and write a PostData() method which would take the data to be send, makes a Post request and returns the Observable of PushApiResponse created above.

import { Injectable } from ‘@angular/core’;
import {
   HttpClient,
   HttpHeaders,
   HttpParams
} from ‘@angular/common/http’;
import { Observable } from ‘rxjs’;
import {
	ApiResponse,
	PushApiResponse
} from ‘../models’;

@Injectable({
   providedIn: ‘root’
})
export class PushService {

   constructor( private http: HttpClient ) { }
   public postData(data: ApiResponse):
   		Observable<PushApiResponse> {

	const httpUrl = ‘https://api.loklak.org/
		api/push.json’;
	const headers = new HttpHeaders({
		‘Content-Type’: ‘application/
			x-www-form-urlencoded’,
		‘Accept’: ‘application/json’,
		‘cache-control’: ‘no-cache’
	});
	const {search_metadata, statuses} = data;
	
	// Converting the object to JSON string.
	const dataToSend = JSON.stringify({
		search_metadata: search_metadata,
		statuses});
	
	// Setting the data to send in
	// HttpParams() with key as ‘data’
	const body = new HttpParams()
		.set(‘data’, dataToSend);
	
	// Making a Post request to api/push.json
	// endpoint. Response Object is converted
	// to PushApiResponse type.
	return this.http.post<PushApiResponse>(
		httpUrl, body, {headers:
		headers
	});
   }
}

Note: Data (dataToSend) send to backend should be exactly in same format as obtained from server.

Pushing data into service dynamically

Now the main part is to provide the data to be send into the service. To make it dynamic, import the Push Service in ‘api-search.effects.ts’ file under effects and create the object of Push Service in its constructor.

import { PushService } from ‘../services’;
constructor(
   …
   private pushService: PushService
) { }

Now, call the pushService object inside ‘relocateAfterSearchSuccess$’ effect method and pass the search response data (payload value of search success action) inside Push Service’s postData() method.

@Effect()
relocateAfterSearchSuccess$: Observable<Action>
   = this.actions$
       .pipe(
           ofType(
               apiAction.ActionTypes
			   	.SEARCH_COMPLETE_SUCCESS,
               apiAction.ActionTypes
			   	.SEARCH_COMPLETE_FAIL
           ),
           withLatestFrom(this.store$),
           map(([action, state]) => {
               this.pushService
			   .postData(action[‘payload’]);
           …
       );

Testing Successful Push to Backend

To test the success of Post request, subscribe to the response data and print the response data on console. You should see something like:

Where each of these is a response of one successful Post request.

Resources

Angular: Services
Angular: Sending data to the server

Indexing for multiscrapers in Loklak Server

Post author:Vibhor Verma
Post published:September 4, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

I recently added multiscraper system which can scrape data from web-scrapers like YoutubeScraper, QuoraScraper, GithubScraper, etc. As scraping is a costly task, it is important to improve it’s efficiency. One of the approach is to index data in cache. TwitterScraper uses multiple sources to optimize the efficiency.

This system uses Post message holder object to store data and PostTimeline (a specialized iterator) to iterate the data objects. This difference in data structures from TwitterScraper leads to the need of different approach to implement indexing of data to ElasticSearch (currently in review process).

These are the following changes I made while implementing ‘indexing of data’ in the project.

1) Writing of data is invoked only using PostTimeline iterator

In TwitterScraper, the data is written in message holder TwitterTweet. So all the tweets are written to index as they are created. Here, when the data is scraped, Writing of the posts is initiated. Scraping of data is considered a heavy process. This approach keeps lower resource usage in average traffic on the server.

protected Post putData(Post typeArray, String key, Timeline2 postList) {
   if(!"cache".equals(this.source)) {
       postList.writeToIndex();
   }
   return this.putData(typeArray, key, postList.toArray());
}

2) One object for holding a message

During the implementation, I kept the same message holder Post and post-iterator PostTimeline from scraping to indexing of data. This helps to keep the structure uniform. Earlier approach involves different types of message wrappers in the way. This approach cuts the processes for looping and transitioning of data structures.

3) Index a list, not a message

In TwitterScraper, as the messages are enqueued in the bulk to be indexed. But in this approach, I have enqueued the complete lists. This approach delays the indexing till the scraper is done with processing the html.

Creating the queue of postlists:

// Add post-lists to queue to be indexed
queueClients.incrementAndGet();
try {
    postQueue.put(postList);
} catch (InterruptedException e) {
DAO.severe(e);
}
queueClients.decrementAndGet();

Indexing of the posts in postlists:

// Start indexing of data in post-lists
for (Timeline2 postList: postBulk) {
    if (postList.size() < 1) continue;
    if(postList.dump) {
        // Dumping of data in a file
        writeMessageBulkDump(postList);
    }
    // Indexing of data to ElasticSearch
    writeMessageBulkNoDump(postList);
}

4) Categorizing the input parameters

While searching the index, I have divided the query parameters from scraper into 3 categories. The input parameters are added to those categories (implemented using map data structure) and thus data fetched are according to them. These categories are:

// Declaring the QueryBuilder
BoolQueryBuilder query = new BoolQueryBuilder();

a) Get the parameter– Get the results for the input fields in map getMap.

// Result must have these fields. Acts as AND operator
if(getMap != null) {
    for(Map.Entry<String, String> field : getMap.entrySet()) {
        query.must(QueryBuilders.termQuery(
field.getKey(), field.getValue()));
    }
}

b) Don’t get the parameter- Don’t get the results for the input fields in map notGetMap.

// Result must not have these fields.
if(notGetMap != null) {
    for(Map.Entry<String, String> field : notGetMap.entrySet()) {
        query.mustNot(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));
    }
}

c) Get if possible- Get the results with the input fields if they are present in the index.

// Result may preferably also get these fields. Acts as OR operator
if(mayAlsoGetMap != null) {
    for(Map.Entry<String, String> field : mayAlsoGetMap.entrySet()) {
        query.should(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));

    }
}

By applying these changes, the scrapers are shifted from a message indexing to list of messages indexing. This way we are keeping load on RAM low, but the aggregation of latest scraped data may be affected. So there will be a need to workaround to solve this issue while scraping itself.

References

Match query with “operator”:”and” via the Java API: https://discuss.elastic.co/t/match-query-with-operator-and-via-the-java-api/67863/2
How to use BoolQueryBuilder: https://stackoverflow.com/questions/40923945/how-to-add-bool-query-inside-a-should-must-method-in-java-api

Setting Loklak Server with SSL

Post author:Vibhor Verma
Post published:September 2, 2017
Post category:FOSSASIA loklak Open Event
Post comments:0 Comments

Loklak Server is based on embedded Jetty Server which can work both with or without SSL encryption. Lately, there was need to setup Loklak Server with SSL. Though the need was satisfied by CloudFlare. Alternatively, there are 2 ways to set up Loklak Server with SSL. They are:-

1) Default Jetty Implementation

There is pre-existing implementation of Jetty libraries. The http mode can be set in configuration file. There are 4 modes on which Loklak Server can work: http mode, https mode, only https mode and redirect to https mode. Loklak Server listens to port 9000 when in http mode and to port 9443 when in https mode.

There is also a need of SSL certificate which is to be added in configuration file.

2) Getting SSL certificate with Kube-Lego on Kubernetes Deployment

I got to know about Kube-Lego by @niranjan94. It is implemented in Open-Event-Orga-Server. The approach is to use:

a) Nginx as ingress controller

For setting up Nginx ingress controller, a yml file is needed which downloads and configures the server.

The configurations for data requests and response are:

proxy-connect-timeout: "15"
 proxy-read-timeout: "600"
 proxy-send-imeout: "600"
 hsts-include-subdomains: "false"
 body-size: "64m"
 server-name-hash-bucket-size: "256"
 server-tokens: "false"

Nginx is configured to work on both http and https ports in service.yml

ports:
- port: 80
  name: http
- port: 443
  name: https

b) Kube-Lego for fetching SSL certificates from Let’s Encrypt

Kube-Lego was set up with default values in yml. It uses the host-name, email address and secretname of the deployment to validate url and fetch SSL certificate from Let’s Encrypt.

c) Setup configurations related to TLS and no-TLS connection

These configuration files mentions the path and service ports for Nginx Server through which requests are forwarded to backend Loklak Server. Here for no-TLS and TLS requests, the requests are directly forwarded to localhost at port 80 of Loklak Server container.

rules:
- host: staging.loklak.org
  http:
  paths:
  - path: /
    backend:
    serviceName: server
    servicePort: 80

For TLS requests, the secret name is also mentioned. Kube-Lego fetches host-name and secret-name from here for the certificate

tls:
- hosts:
- staging.loklak.org
  secretName: loklak-api-tls

d) Loklak Server, ElasticSearch and Mosquitto at backend

These containers work at backend. ElasticSearch and Mosquitto are only accessible to Loklak Server. Loklak Server can be accessed through Nginx server. Loklak Server is configured to work at http mode and is exposed at port 80.

ports:
- port: 80
  protocol: TCP
  targetPort: 80

To deploy the Loklak Server, all these are deployed in separate pods and they interact through service ports. To deploy, we use deploy script:

# For elasticsearch, accessible only to api-server
kubectl create -R -f ${path-to-config-file}/elasticsearch/

# For mqtt, accessible only to api-server
kubectl create -R -f ${path-to-config-file}/mosquitto/

# Start KubeLego deployment for TLS certificates
kubectl create -R -f ${path-to-config-file}/lego/
kubectl create -R -f ${path-to-config-file}/nginx/

# Create web namespace, this acts as bridge to Loklak Server
kubectl create -R -f ${path-to-config-file}/web/

# Create API server deployment and expose the services
kubectl create -R -f ${path-to-config-file}/api-server/

# Get the IP address of the deployment to be used
kubectl get services --namespace=nginx-ingress

References

kube-lego with GCE ingress controller: https://github.com/jetstack/kube-lego/tree/master/examples/gce
What’s the difference between SSL, TLS, and HTTPS: https://security.stackexchange.com/questions/5126/whats-the-difference-between-ssl-tls-and-https
Standalone HTTPS with Jetty: https://wiki.opennms.org/wiki/Standalone_HTTPS_with_Jetty

Fetching Metadata in Loklak Server

Post author:Vibhor Verma
Post published:September 1, 2017
Post category:FOSSASIA loklak
Post comments:0 Comments

In Loklak Server multiscrapers are working fine but there was a need to setup metadata framework to be embedded with the data. Metadata outputs the parameters passed, number of hits on the webpage to fetch results and number of results outputted.

There is no metadata framework for TwitterScraper. Metadata is collected but there are 2 issues:

1) the metadata fields are directly feeded while outputing data.

2) Every Scraper had different metadata fields or none.

To improve this for multiscraper system, I embedded metadata by configuring in the BaseScraper class and in PostTimeline iterator. If the metadata is directly collected in BaseScraper itself, then it will become non-of-developer-concern while working on scrapers and he can concentrate on improving scrapers.

These are the following changes I made in code:

1) Input Get-Parameters

For scrapers, one of the metadata field was input parameters. I directly added them in metadata block.

protected Post getMetadata() {
    Post metadata = new Post(true);
    metadata.put("hits", this.hits);
    metadata.put("count", this.count);
    metadata.put("scraper", this.scraperName);
    metadata.put("input_parameters", this.extra);
    return metadata;
}

2) Hits and Counts

Hits refer to number of times Loklak Server made a hit to the target website where as Counts refer to number of posts scraped by the scraper. To fetching these data was easy.

For count, I added a method putData in BaseScraper. It shall be used to create list of posts instead of directly creating the list. Here I have added counter which counts the posts.

protected Post putData(Post typeArray, String key, JSONArray postList) {
    this.count = this.count + postList.length();
    typeArray.put(key, postList);
    return typeArray;
}

For hits, I just counted the number of times the URL was fed into ClientConnection method.

public Post getDataFromConnection(String url, String type) throws IOException {
// This adds to hits count even if connection fails
    this.hits++;
    ClientConnection connection = new ClientConnection(url);
.
.
.

3) For multiscrapers in Search Endpoint

This was a bit tricky task. For creating metadata block for all the scrapers, I had to fetch metadata block of all the scrapers, process them and then output with the results. I added this to PostTimeline iterator and implemented in a loop when a scraper outputs data.

public void collectMetadata(JSONObject metadata) {
    // INITIALIZE PARAMETERS
    int hits = 0;
    int count = 0;
    Set scrapers = new HashSet<String>();

    // GET LIST OF KEYS IN SCRAPER
    List<String> listKeys = new ArrayList<String>(this.posts.keySet());
    int n = listKeys.size();

    for (int i = 0; i < n; i++) {
        // FETCH METADATA POST FROM SCRAPED DATA
        Post postMetadata = (Post) this.posts.get(listKeys.get(i)).get("metadata");
        hits = hits + Integer.parseInt(String.valueOf(postMetadata.get("hits")));
        count = count + Integer.parseInt(String.valueOf(postMetadata.get("count")));
        scrapers.add(postMetadata.get("scraper"));
    }

    // SET OUTPUT
    metadata.put("hits", hits);
    metadata.put("count", count);
    metadata.put("scraper_count", scrapers.size());
    metadata.put("scrapers", scrapers);
}

References

Crawlers and Metadata Extraction (Stuff that needs to be solved): https://vimeo.com/53109189
Why Metadata? https://www.villanovau.com/resources/bi/metadata-importance-in-data-driven-world/#.WZmbMKvhXeQ

Introducing Stream Servlet in loklak Server

Post author:Pratyush
Post published:August 26, 2017
Post category:FOSSASIA GSoC loklak Open Event Tutorial
Post comments:0 Comments

A major part of my GSoC proposal was adding stream API to loklak server. In a previous blog post, I discussed the addition of Mosquitto as a message broker for MQTT streaming. After testing this service for a few days and some minor improvements, I was in a position to expose the stream to outside users using a simple API.

In this blog post, I will be discussing the addition of /api/stream.json endpoint to loklak server.

HTTP Server-Sent Events

Server-sent events (SSE) is a technology where a browser receives automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.

– Wikipedia

This API is supported by all major browsers except Microsoft Edge. For loklak, the plan was to use this event system to send messages, as they arrive, to the connected users. Apart from browser support, EventSource API can also be used with many other technologies too.

Jetty Eventsource Plugin

For Java, we can use Jetty’s EventSource plugin to send events to clients. It is similar to other Jetty servlets when it comes to processing the arguments, handling requests, etc. But it provides a simple interface to send events as they occur to connected users.

Adding Dependency

To use this plugin, we can add the following line to Gradle dependencies –

compile group: 'org.eclipse.jetty', name: 'jetty-eventsource-servlet', version: '1.0.0'

Creating Push Service

Creating model for Push Api Response

Writing Post request in Push Service

Pushing data into service dynamically

Testing Successful Push to Backend

Resources

1) Writing of data is invoked only using PostTimeline iterator

2) One object for holding a message

3) Index a list, not a message

4) Categorizing the input parameters

References

1) Default Jetty Implementation

2) Getting SSL certificate with Kube-Lego on Kubernetes Deployment

a) Nginx as ingress controller

References

1) Input Get-Parameters

2) Hits and Counts

3) For multiscrapers in Search Endpoint

References

HTTP Server-Sent Events

Jetty Eventsource Plugin

Adding Dependency

The Event Source

Cross Site Requests

Adding MQTT Subscriber

Connecting MQTT Stream to SSE Stream

Closing Stream on Disconnecting from User

Conclusion

Resources

References

Usage:

The configurations added in Loklak Server are:-

1) Endpoint

2) PrepareSearchUrl

3) Get-Data-From-Connection

4) Set scraper parameters input as get-parameters

5) Scrape Method

References

Initial Image

Moving to Apline

Reducing Content Size

Optimizing Number of Layers

Conclusion

Resources

Volatile Disk in Kubernetes

Persistent Disk

Rolling Updates and Persistent Disk

Persistent Volume Claims

Verifying persistence of Dumps

Conclusion

Resources

Installation and Dependency for Mosquitto

The MQTTPublisher Class

Starting Publisher with Server

Adding Mosquitto to Kubernetes

Changes in Dockerfile

Mosquitto Deployment

Exposing Mosquitto to the Cluster

Conclusion

Resources