Pratyush | blog.fossasia.org

Adding Tweet Streaming Feature in World Mood Tracker loklak App

Post author:Pratyush
Post published:September 1, 2017
Post category:API FOSSASIA GSoC loklak Open Event
Post comments:0 Comments

The World Mood Tracker was added to loklak apps with the feature to display aggregated data from the emotion classifier of loklak server. The next step in the app was adding the feature to display the stream of Tweets from a country as they are discovered by loklak. With the addition of stream servlet in loklak, it was possible to utilise it in this app.

In this blog post, I will be discussing the steps taken while adding to introduce this feature in World Mood Tracker app.

Props for WorldMap component

The WorldMap component holds the view for the map displayed in the app. This is where API calls to classifier endpoint are made and results are displayed on the map. In order to display tweets on clicking a country, we need to define react props so that methods from higher level components can be called.

In order to enable props, we need to change the constructor for the component –

export default class WorldMap extends React.Component {
    constructor(props) {
        super(props);
        ...
    }
    ...
}

[SOURCE]

We can now pass the method from parent component to enable streaming and other components can close the stream by using props in them –

export default class WorldMoodTracker extends React.Component {
    ...
    showStream(countryName, countryCode) {
        /* Do something to enable streaming component */
        ...
    }
 
    render() {
        return (
             ...
                <WorldMap showStream={this.showStream}/>
             ...
        )
    }
}

[SOURCE]

Defining Actions on Clicking Country Map

As mentioned in an earlier blog post, World Mood Tracker uses Datamaps to visualize data on a map. In order to trigger a piece of code on clicking a country, we can use the “done” method of the Datamaps instance. This is where we use the props passed earlier –

done: function(datamap) {
    datamap.svg.selectAll('.datamaps-subunit').on('click', function (geography) {
        props.showStream(geography.properties.name, reverseCountryCode(geography.id));
    })
}

[SOURCE]

The name and ID for the country will be used to display name and make API call to stream endpoint respectively.

The StreamOverlay Component

The StreamOverlay components hold all the utilities to display the stream of Tweets from loklak. This component is used from its parent components whose state holds info about displaying this component –

export default class WorldMoodTracker extends React.Component {
    ...
    getStreamOverlay() {
        if (this.state.enabled) {
            return (<StreamOverlay
                show={true} channel={this.state.channel}
                country={this.state.country} onClose={this.onOverlayClose}/>);
        }
    }

    render() {
        return (
            ...
                {this.getStreamOverlay()}
            ...
        )
    }
}

[SOURCE]

The corresponding props passed are used to render the component and connect to the stream from loklak server.

Creating Overlay Modal

On clicking the map, an overlay is shown. To display this overlay, react-overlays is used. The Modal component offered by the packages provides a very simple interface to define the design and interface of the component, including style, onclose hook, etc.

import {Modal} from 'react-overlays';

<Modal aria-labelledby='modal-label'
    style={modalStyle}
    backdropStyle={backdropStyle}
    show={true}
    onHide={this.close}>
    <div style={dialogStyle()}>
        ...
    </div>
</Modal>

[SOURCE]

It must be noted that modalStyle and backdropStyle are React style objects.

Dialog Style

The dialog style is defined to provide some space at the top, clicking where, the overlay is closed. To do this, vertical height units are used –

const dialogStyle = function () {
    return {
        position: 'absolute',
        width: '100%',
        top: '5vh',
        height: '95vh',
        padding: 20
        ...
    };
};

[SOURCE]

Connecting to loklak Tweet Stream

loklak sends Server Sent Events to clients connected to it. To utilise this stream, we can use the natively supported EventSource object. Event stream is started with the render method of the StreamOverlay component –

render () {
    this.startEventSource(this.props.channel);
    ...
}

[SOURCE]

This channel is used to connect to twitter/country/<country-ID> channel on the stream and then this can be passed to EventStream constructor. On receiving a message, a list of Tweets is appended and later rendered in the view –

startEventSource(country) {
    let channel = 'twitter%2Fcountry%2F' + country;
    if (this.eventSource) {
        return;
    }
    this.eventSource = new EventSource(host + '/api/stream.json?channel=' + channel);
    this.eventSource.onmessage = (event) => {
        let json = JSON.parse(event.data);
        this.state.tweets.push(json);
        if (this.state.tweets.length > 250) {
            this.state.tweets.shift();
        }
        this.setState(this.state);
    };
}

[SOURCE]

The size of the list is restricted to 250 here, so when a newer Tweet comes in, the oldest one is chopped off. And thanks to fast DOM actions in React, the rendering doesn’t take much time.

Rendering Tweets

The Tweets are displayed as simple cards on which user can click to open it on Twitter in a new tab. It contains basic information about the Tweet – screen name and Tweet text. Images are not rendered as it would make no sense to load them when Tweets are coming at a high rate.

function getTweetHtml(json) {
    return (
        <div style={{padding: '5px', borderRadius: '3px', border: '1px solid black', margin: '10px'}}>
            <a href={json.link} target="_blank">
            <div style={{marginBottom: '5px'}}>
                <b>@{json['screen_name']}</b>
            </div>
            <div style={{overflowX: 'hidden'}}>{json['text']}</div>
            </a>
        </div>
    )
}

[SOURCE]

They are rendered using a simple map in the render method of StreamOverlay component –

<div className={styles.container} style={{'height': '100%', 'overflowY': 'auto',
    'overflowX': 'hidden', maxWidth: '100%'}}>
    {this.state.tweets.reverse().map(getTweetHtml)}
</div>

[SOURCE]

Closing Overlay

With the previous setup in place, we can now see Tweets from the loklak backend as they arrive. But the problem is that we will still be connected to the stream when we click-close the modal. Also, we would need to close the overlay from the parent component in order to stop rendering it.

We can use the onclose method for the Modal here –

close() {
    if (this.eventSource) {
        this.eventSource.close();
        this.eventSource = null;
    }
    this.props.onClose();
}

[SOURCE]

Here, props.onClose() disables rendering of StreamOverlay in the parent component.

Conclusion

In this blog post, I explained how the flow of props are used in the World Mood Tracker app to turn on and off the streaming in the overlay defined using react-overlays. This feature shows a basic setup for using the newly introduced stream API in loklak.

The motivation of such application was taken from emojitracker by mroth as mentioned in fossasia/labs.fossasia.org#136. The changes were proposed in fossasia/apps.loklak.org#315 by @singhpratyush (me).

The app can be accessed live at https://singhpratyush.github.io/world-mood-tracker/index.html.

Resources

DataMaps documentation – https://github.com/markmarkoh/datamaps/blob/master/README.md#getting-started.
React-overlays documentation and examples – https://react-bootstrap.github.io/react-overlays/.
How I Built Emojitracker by Matthew Rothenberg – https://medium.com/@mroth/how-i-built-emojitracker-179cfd8238ac.
Props in ReactJS – https://facebook.github.io/react/docs/components-and-props.html.
Channels available in loklak server for streaming – https://github.com/loklak/loklak_server/blob/development/docs/misc/StreamChannels.md.

Introducing Stream Servlet in loklak Server

Post author:Pratyush
Post published:August 26, 2017
Post category:FOSSASIA GSoC loklak Open Event Tutorial
Post comments:0 Comments

A major part of my GSoC proposal was adding stream API to loklak server. In a previous blog post, I discussed the addition of Mosquitto as a message broker for MQTT streaming. After testing this service for a few days and some minor improvements, I was in a position to expose the stream to outside users using a simple API.

In this blog post, I will be discussing the addition of /api/stream.json endpoint to loklak server.

HTTP Server-Sent Events

Server-sent events (SSE) is a technology where a browser receives automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.

– Wikipedia

This API is supported by all major browsers except Microsoft Edge. For loklak, the plan was to use this event system to send messages, as they arrive, to the connected users. Apart from browser support, EventSource API can also be used with many other technologies too.

Jetty Eventsource Plugin

For Java, we can use Jetty’s EventSource plugin to send events to clients. It is similar to other Jetty servlets when it comes to processing the arguments, handling requests, etc. But it provides a simple interface to send events as they occur to connected users.

Adding Dependency

To use this plugin, we can add the following line to Gradle dependencies –

compile group: 'org.eclipse.jetty', name: 'jetty-eventsource-servlet', version: '1.0.0'

[SOURCE]

The Event Source

An EventSource is the object which is required for EventSourceServlet to send events. All the logics for emitting events needs to be defined in the related class. To link a servlet with an EventSource, we need to override the newEventSource method –

public class StreamServlet extends EventSourceServlet {
    @Override
    protected EventSource newEventSource(HttpServletRequest request) {
        String channel = request.getParameter("channel");
        if (channel == null) {
            return null;
        }
        if (channel.isEmpty()) {
            return null;
        }
        return new MqttEventSource(channel);
    }
}

[SOURCE]

If no channel is provided, the EventSource object will be null and the request will be rejected. Here, the MqttEventSource would be used to handle the stream of Tweets as they arrive from the Mosquitto message broker.

Cross Site Requests

Since the requests to this endpoint can’t be of JSONP type, it is necessary to allow cross site requests on this endpoint. This can be done by overriding the doGet method of the servlet –

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
     response.setHeader("Access-Control-Allow-Origin", "*");
    super.doGet(request, response);
}

[SOURCE]

Adding MQTT Subscriber

When a request for events arrives, the constructor to MqttEventSource is called. At this stage, we need to connect to the stream from Mosquitto for the channel. To achieve this, we can set the class as MqttCallback using appropriate client configurations –

public class MqttEventSource implements MqttCallback {
    ...
    MqttEventSource(String channel) {
        this.channel = channel;
    }
    ...
    this.mqttClient = new MqttClient(address, "loklak_server_subscriber");
    this.mqttClient.connect();
    this.mqttClient.setCallback(this);
    this.mqttClient.subscribe(this.channel);
    ...
}

[SOURCE]

By setting the callback to this, we can override the messageArrived method to handle the arrival of a new message on the channel. Just to mention, the client library used here is Eclipse Paho.

Connecting MQTT Stream to SSE Stream

Now that we have subscribed to the channel we wish to send events from, we can use the Emitter to send events from our EventSource by implementing it –

public class MqttEventSource implements EventSource, MqttCallback {
    private Emitter emitter;


    @Override
    public void onOpen(Emitter emitter) throws IOException {
        this.emitter = emitter;
        ...
    }

    @Override
    public void messageArrived(String topic, MqttMessage message) throws Exception {
        this.emitter.data(message.toString());
    }
}

[SOURCE]

Closing Stream on Disconnecting from User

When a client disconnects from the stream, it doesn’t makes sense to stay connected to the server. We can use the onClose method to disconnect the subscriber from the MQTT broker –

@Override
public void onClose() {
    try {
        this.mqttClient.close();
        this.mqttClient.disconnect();
    } catch (MqttException e) {
        // Log some warning 
    }
}

[SOURCE]

Conclusion

In this blog post, I discussed connecting the MQTT stream to SSE stream using Jetty’s EventSource plugin. Once in place, this event system would save us from making too many requests to collect and visualize data. The possibilities of applications of such feature are huge.

This feature can be seen in action at the World Mood Tracker app.

The changes were introduced in pull request loklak/loklak_server#1474 by @singhpratyush (me).

Resources

Using Mosquitto as a Message Broker for MQTT in loklak Server – https://blog.fossasia.org/using-mosquitto-as-a-message-broker-for-mqtt-in-loklak-server/.
Using server-sent events – Mozilla Developers Network – https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events.
Server-Sent Events W3C Specifications – https://www.w3.org/TR/2009/WD-eventsource-20090423/.
More about Eclipse Paho Java Client – http://www.eclipse.org/paho/clients/java/.
Event Source Example by marschall – https://github.com/marschall/event-source-sample.

Optimising Docker Images for loklak Server

Post author:Pratyush
Post published:August 15, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

The loklak server is in a process of moving to Kubernetes. In order to do so, we needed to have different Docker images that suit these deployments. In this blog post, I will be discussing the process through which I optimised the size of Docker image for these deployments.

Initial Image

The image that I started with used Ubuntu as base. It installed all the components needed and then modified the configurations as required –

FROM ubuntu:latest

# Env Vars
ENV LANG=en_US.UTF-8
ENV JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
ENV DEBIAN_FRONTEND noninteractive

WORKDIR /loklak_server

RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install -y git openjdk-8-jdk
RUN git clone https://github.com/loklak/loklak_server.git /loklak_server
RUN git checkout development
RUN ./gradlew build -x test -x checkstyleTest -x checkstyleMain -x jacocoTestReport
RUN sed -i.bak 's/^\(port.http=\).*/\180/' conf/config.properties
... # More configurations
RUN echo "while true; do sleep 10;done" >> bin/start.sh

# Start
CMD ["bin/start.sh", "-Idn"]

The size of images built using this Dockerfile was quite huge –

REPOSITORY          TAG                 IMAGE ID            CREATED              SIZE

loklak_server       latest              a92f506b360d        About a minute ago   1.114 GB

ubuntu              latest              ccc7a11d65b1        3 days ago           120.1 MB

But since this size is not acceptable, we needed to reduce it.

Moving to Apline

Alpine Linux is an extremely lightweight Linux distro, built mainly for the container environment. Its size is so tiny that it hardly puts any impact on the overall size of images. So, I replaced Ubuntu with Alpine –

FROM alpine:latest

...
RUN apk update
RUN apk add git openjdk8 bash
...

And now we had much smaller images –

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

loklak_server       latest              54b507ee9187        17 seconds ago      668.8 MB

alpine              latest              7328f6f8b418        6 weeks ago         3.966 MB

As we can see that due to no caching and small size of Alpine, the image size is reduced to almost half the original.

Reducing Content Size

There are many things in a project which are no longer needed while running the project, like the .git folder (which is huge in case of loklak) –

$ du -sh loklak_server/.git
236M loklak_server/.git

We can remove such files from the Docker image and save a lot of space –

rm -rf .[^.] .??*

Optimizing Number of Layers

The number of layers also affect the size of the image. More the number of layers, more will be the size of image. In the Dockerfile, we can club together the RUN commands for lower number of images.

RUN apk update && apk add openjdk8 git bash && \
  git clone https://github.com/loklak/loklak_server.git /loklak_server && \
  ...

After this, the effective size is again reduced by a major factor –

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

loklak_server       latest              54b507ee9187        17 seconds ago      422.3 MB

alpine              latest              7328f6f8b418        6 weeks ago         3.966 MB

Conclusion

In this blog post, I discussed the process of optimising the size of Dockerfile for Kubernetes deployments of loklak server. The size was reduced to 426 MB from 1.234 GB and this provided much faster push/pull time for Docker images, and therefore, faster updates for Kubernetes deployments.

Resources

Docker Tutorial Series : Writing a Dockerfile – https://rominirani.com/docker-tutorial-series-writing-a-dockerfile-ce5746617cd.
Deploying loklak Server on Kubernetes with External Elasticsearch – https://blog.fossasia.org/deploying-loklak-server-on-kubernetes-with-external-elasticsearch/
Dockerfile Best Practices – https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/.
How to Optimize Your Dockerfile – https://blog.tutum.co/2014/10/22/how-to-optimize-your-dockerfile/.

Persistently Storing loklak Server Dumps on Kubernetes

Post author:Pratyush
Post published:August 11, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

In an earlier blog post, I discussed loklak setup on Kubernetes. The deployment mentioned in the post was to test the development branch. Next, we needed to have a deployment where all the messages are collected and dumped in text files that can be reused.

In this blog post, I will be discussing the challenges with such deployment and the approach to tackle them.

Volatile Disk in Kubernetes

The pods that hold deployments in Kubernetes have disk storage. Any data that gets written by the application stays only until the same version of deployment is running. As soon as the deployment is updated/relocated, the data stored during the application is cleaned up.

Due to this, dumps are written when loklak is running but they get wiped out when the deployment image is updated. In other words, all dumps are lost when the image updates. We needed to find a solution to this as we needed a permanent storage when collecting dumps.

Persistent Disk

In order to have a storage which can hold data permanently, we can mount persistent disk(s) on a pod at the appropriate location. This ensures that the data that is important to us stays with us, even
when the deployment goes down.

In order to add persistent disks, we first need to create a persistent disk. On Google Cloud Platform, we can use the gcloud CLI to create disks in a given region –

gcloud compute disks create --size=<required size> --zone=<same as cluster zone> <unique disk name>

After this, we can mount it on a Docker volume defined in Kubernetes configurations –

      ...
      volumeMounts:
        - mountPath: /path/to/mount
          name: volume-name
  volumes:
    - name: volume-name
      gcePersistentDisk:
        pdName: disk-name
        fsType: fileSystemType

But this setup can’t be used for storing loklak dumps. Let’s see “why” in the next section.

Rolling Updates and Persistent Disk

The Kubernetes deployment needs to be updated when the master branch of loklak server is updated. This update of master deployment would create a new pod and try to start loklak server on it. During all this, the older deployment would also be running and serving the requests.

The control will not be transferred to the newer pod until it is ready and all the probes are passing. The newer deployment will now try to mount the disk which is mentioned in the configuration, but it would fail to do so. This would happen because the older pod has already mounted the disk.

Therefore, all new deployments would simply fail to start due to insufficient resources. To overcome such issues, Kubernetes allows persistent volume claims. Let’s see how we used them for loklak deployment.

Persistent Volume Claims

Kubernetes provides Persistent Volume Claims which claim resources (storage) from a Persistent Volume (just like a pod does from a node). The higher level APIs are provided by Kubernetes (configurations and kubectl command line). In the loklak deployment, the persistent volume is a Google Compute Engine disk –

apiVersion: v1
kind: PersistentVolume
metadata:
  name: dump
  namespace: web
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: slow
  gcePersistentDisk:
    pdName: "data-dump-disk"
    fsType: "ext4"

[SOURCE]

It must be noted here that a persistent disk by the name of data-dump-index is already created in the same region.

The storage class defines the way in which the PV should be handled, along with the provisioner for the service –

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: slow
  namespace: web
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
  zone: us-central1-a

[SOURCE]

After having the StorageClass and PersistentVolume, we can create a claim for the volume by using appropriate configurations –

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dump
  namespace: web
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: slow

[SOURCE]

After this, we can mount this claim on our Deployment –

  ...
  volumeMounts:
    - name: dump
      mountPath: /loklak_server/data
volumes:
  - name: dump
    persistentVolumeClaim:
      claimName: dump

[SOURCE]

Verifying persistence of Dumps

To verify this, we can redeploy the cluster using the same persistent disk and check if the earlier dumps are still present there –

$ http http://link.to.deployment/dump/
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Type: text/html;charset=utf-8
...


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
…
<h1>Index of /dump</h1>
<pre>      Name 
[gz ] <a href="messages_20170802_71562040.txt.gz">messages_20170802_71562040.txt.gz</a>   Thu Aug 03 00:07:21 GMT 2017   132M
[gz ] <a href="messages_20170803_69925009.txt.gz">messages_20170803_69925009.txt.gz</a>   Mon Aug 07 15:40:04 GMT 2017   532M
[gz ] <a href="messages_20170807_36357603.txt.gz">messages_20170807_36357603.txt.gz</a>   Wed Aug 09 10:26:24 GMT 2017   377M
[txt] <a href="messages_20170809_27974404.txt">messages_20170809_27974404.txt</a>      Thu Aug 10 08:51:49 GMT 2017  1564M
<hr></pre>
...

Conclusion

In this blog post, I discussed the process of deployment of loklak with persistent dumps on Kubernetes. This deployment is intended to work as root.loklak.org in near future. The changes were proposed in loklak/loklak_server#1377 by @singhpratyush (me).

Resources

Deployment loklak image – https://github.com/loklak/loklak_server/blob/e896ee9e77f818e5c1c872ab0e623af71fa396b8/kubernetes/images/master/Dockerfile.
Kubernetes Volumes – https://kubernetes.io/docs/concepts/storage/volumes/.
Dynamic Provisioning and Storage Classes in Kubernetes – http://blog.kubernetes.io/2016/10/dynamic-provisioning-and-storage-in-kubernetes.html.
Elasticsearch deployment in loklak – https://github.com/loklak/loklak_server/tree/e896ee9e77f818e5c1c872ab0e623af71fa396b8/kubernetes/yamls/development/elasticsearch.

Using Mosquitto as a Message Broker for MQTT in loklak Server

Post author:Pratyush
Post published:August 9, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

In loklak server, messages are collected from various sources and indexed using Elasticsearch. To know when a message of interest arrives, users can poll the search endpoint. But this method would require a lot of HTTP requests, most of them being redundant. Also, if a user would like to collect messages for a particular topic, he would need to make a lot of requests over a period of time to get enough data.

For GSoC 2017, my proposal was to introduce stream API in the loklak server so that we could save ourselves from making too many requests and also add many use cases.

Mosquitto is Eclipse’s project which acts as a message broker for the popular MQTT protocol. MQTT, based on the pub-sub model, is a lightweight and IOT friendly protocol. In this blog post, I will discuss the basic setup of Mosquitto in the loklak server.

Installation and Dependency for Mosquitto

The installation process of Mosquitto is very simple. For Ubuntu, it is available from the pre installed PPAs –

sudo apt-get install mosquitto

Once the message broker is up and running, we can use the clients to connect to it and publish/subscribe to channels. To add MQTT client as a project dependency, we can introduce following line in Gradle dependencies file –

compile group: 'net.sf.xenqtt', name: 'xenqtt', version: '0.9.5'

[SOURCE]

After this, we can use the client libraries in the server code base.

The MQTTPublisher Class

The MQTTPublisher class in loklak would provide an interface to perform basic operations in MQTT. The implementation uses AsyncClientListener to connect to Mosquitto broker –

AsyncClientListener listener = new AsyncClientListener() {
    // Override methods according to needs
};

[SOURCE]

The publish method for the class can be used by other components of the project to publish messages on the desired channel –

public void publish(String channel, String message) {
    this.mqttClient.publish(new PublishMessage(channel, QoS.AT_LEAST_ONCE, message));
}

[SOURCE]

We also have methods which allow publishing of multiple messages to multiple channels in order to increase the functionality of the class.

Starting Publisher with Server

The flags which signal using of streaming service in loklak are located in conf/config.properties. These configurations are referred while initializing the Data Access Object and an MQTTPublisher is created if needed –

String mqttAddress = getConfig("stream.mqtt.address", "tcp://127.0.0.1:1883");
streamEnabled = getConfig("stream.enabled", false);
if (streamEnabled) {
    mqttPublisher = new MQTTPublisher(mqttAddress);
}

[SOURCE]

The mqttPublisher can now be used by other components of loklak to publish messages to the channel they want.

Adding Mosquitto to Kubernetes

Since loklak has also a nice Kubernetes setup, it was very simple to introduce a new deployment for Mosquitto to it.

Changes in Dockerfile

The Dockerfile for master deployment has to be modified to discover Mosquitto broker in the Kubernetes cluster. For this purpose, corresponding flags in config.properties have to be changed to ensure that things work fine –

sed -i.bak 's/^\(stream.enabled\).*/\1=true/' conf/config.properties && \
sed -i.bak 's/^\(stream.mqtt.address\).*/\1=mosquitto.mqtt:1883/' conf/config.properties && \

[SOURCE]

The Mosquitto broker would be available at mosquitto.mqtt:1883 because of the service that is created for it (explained in later section).

Mosquitto Deployment

The Docker image used in Kubernetes deployment of Mosquitto is taken from toke/docker-kubernetes. Two ports are exposed for the cluster but no volumes are needed –

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: mosquitto
  namespace: mqtt
spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: mosquitto
        image: toke/mosquitto
        ports:
        - containerPort: 9001
        - containerPort: 8883

[SOURCE]

Exposing Mosquitto to the Cluster

Now that we have the deployment running, we need to expose the required ports to the cluster so that other components may use it. The port 9001 appears as port 80 for the service and 1883 is also exposed –

apiVersion: v1
kind: Service
metadata:
  name: mosquitto
  namespace: mqtt
  ...
spec:
  ...
  ports:
  - name: mosquitto
    port: 1883
  - name: mosquitto-web
    port: 80
    targetPort: 9001

[SOURCE]

After creating the service using this configuration, we will be able to connect our clients to Mosquitto at address mosquitto.mqtt:1883.

Conclusion

In this blog post, I discussed the process of adding Mosquitto to the loklak server project. This is the first step towards introducing the stream API for messages collected in loklak.

These changes were introduced in pull requests loklak/loklak_server#1393 and loklak/loklak_server#1398 by @singhpratyush (me).

Resources

Deploying loklak Server on Kubernetes with External Elasticsearch – https://blog.fossasia.org/deploying-loklak-server-on-kubernetes-with-external-elasticsearch/.
Mosquitto Documentation – https://mosquitto.org/documentation/.
Manage data in containers – https://docs.docker.com/engine/tutorials/dockervolumes/.
Services in Kubernetes – https://kubernetes.io/docs/concepts/services-networking/service/.
xenqtt API Documentation – http://xenqtt.sourceforge.net/apidocs/0.9.6/.

Deploying loklak Server on Kubernetes with External Elasticsearch

Post author:Pratyush
Post published:August 7, 2017
Post category:FOSSASIA GSoC loklak
Post comments:1 Comment

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

– kubernetes.io

Kubernetes is an awesome cloud platform, which ensures that cloud applications run reliably. It runs automated tests, flawless updates, smart roll out and rollbacks, simple scaling and a lot more.

So as a part of GSoC, I worked on taking the loklak server to Kubernetes on Google Cloud Platform. In this blog post, I will be discussing the approach followed to deploy development branch of loklak on Kubernetes.

New Docker Image

Since Kubernetes deployments work on Docker images, we needed one for the loklak project. The existing image would not be up to the mark for Kubernetes as it contained the declaration of volumes and exposing of ports. So I wrote a new Docker image which could be used in Kubernetes.

The image would simply clone loklak server, build the project and trigger the server as CMD –

FROM alpine:latest

ENV LANG=en_US.UTF-8
ENV JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

WORKDIR /loklak_server

RUN apk update && apk add openjdk8 git bash && \
    git clone https://github.com/loklak/loklak_server.git /loklak_server && \
    git checkout development && \
    ./gradlew build -x test -x checkstyleTest -x checkstyleMain -x jacocoTestReport && \
    # Some Configurations and Cleanups

CMD ["bin/start.sh", "-Idn"]

[SOURCE]

This image wouldn’t have any volumes or exposed ports and we are now free to configure them in the configuration files (discussed in a later section).

Building and Pushing Docker Image using Travis

To automatically build and push on a commit to the master branch, Travis build is used. In the after_success section, a call to push Docker image is made.

Travis environment variables hold the username and password for Docker hub and are used for logging in –

docker login -u $DOCKER_USERNAME -p $DOCKER_PASSWORD

[SOURCE]

We needed checks there to ensure that we are on the right branch for the push and we are not handling a pull request –

# Build and push Kubernetes Docker image
KUBERNETES_BRANCH=loklak/loklak_server:latest-kubernetes-$TRAVIS_BRANCH
KUBERNETES_COMMIT=loklak/loklak_server:kubernetes-$TRAVIS_COMMIT
  
if [ "$TRAVIS_BRANCH" == "development" ]; then
    docker build -t loklak_server_kubernetes kubernetes/images/development
    docker tag loklak_server_kubernetes $KUBERNETES_BRANCH
    docker push $KUBERNETES_BRANCH
    docker tag $KUBERNETES_BRANCH $KUBERNETES_COMMIT
    docker push $KUBERNETES_COMMIT
elif [ "$TRAVIS_BRANCH" == "master" ]; then
    # Build and push master
else
    echo "Skipping Kubernetes image push for branch $TRAVIS_BRANCH"
fi

[SOURCE]

Kubernetes Configurations for loklak

Kubernetes cluster can completely be configured using configurations written in YAML format. The deployment of loklak uses the previously built image. Initially, the image tagged as latest-kubernetes-development is used –

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: server
  namespace: web
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: server
    spec:
      containers:
      - name: server
        image: loklak/loklak_server:latest-kubernetes-development
        ...

[SOURCE]

Readiness and Liveness Probes

Probes act as the top level tester for the health of a deployment in Kubernetes. The probes are performed periodically to ensure that things are working fine and appropriate steps are taken if they fail.

When a new image is updated, the older pod still runs and servers the requests. It is replaced by the new ones only when the probes are successful, otherwise, the update is rolled back.

In loklak, the /api/status.json endpoint gives information about status of deployment and hence is a good target for probes –

livenessProbe:
  httpGet:
    path: /api/status.json
    port: 80
  initialDelaySeconds: 30
  timeoutSeconds: 3
readinessProbe:
  httpGet:
    path: /api/status.json
    port: 80
  initialDelaySeconds: 30
  timeoutSeconds: 3

[SOURCE]

These probes are performed periodically and the server is restarted if they fail (non-success HTTP status code or takes more than 3 seconds).

Ports and Volumes

In the configurations, port 80 is exposed as this is where Jetty serves inside loklak –

ports:
- containerPort: 80
  protocol: TCP

[SOURCE]

If we notice, this is the port that we used for running the probes. Since the development branch deployment holds no dumps, we didn’t need to specify any explicit volumes for persistence.

Load Balancer Service

While creating the configurations, a new public IP is assigned to the deployment using Google Cloud Platform’s load balancer. It starts listening on port 80 –

ports:
- containerPort: 80
  protocol: TCP

[SOURCE]

Since this service creates a new public IP, it is recommended not to replace/recreate this services as this would result in the creation of new public IP. Other components can be updated individually.

Kubernetes Configurations for Elasticsearch

To maintain a persistent index, this deployment would require an external Elasticsearch cluster. loklak is able to connect itself to external Elasticsearch cluster by changing a few configurations.

Docker Image and Environment Variables

The image used for Elasticsearch is taken from pires/docker-elasticsearch-kubernetes. It allows easy configuration of properties from environment variables in configurations. Here is a list of configurable variables, but we needed just a few of them to do our task –

image: quay.io/pires/docker-elasticsearch-kubernetes:2.0.0
env:
- name: KUBERNETES_CA_CERTIFICATE_FILE
  value: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
- name: NAMESPACE
  valueFrom:
    fieldRef:
      fieldPath: metadata.namespace
- name: "CLUSTER_NAME"
  value: "loklakcluster"
- name: "DISCOVERY_SERVICE"
  value: "elasticsearch"
- name: NODE_MASTER
  value: "true"
- name: NODE_DATA
  value: "true"
- name: HTTP_ENABLE
  value: "true"

[SOURCE]

Persistent Index using Persistent Cloud Disk

To make the index last even after the deployment is stopped, we needed a stable place where we could store all that data. Here, Google Compute Engine’s standard persistent disk was used. The disk can be created using GCP web portal or the gcloud CLI.

Before attaching the disk, we need to declare a volume where we could mount it –

volumeMounts:
- mountPath: /data
  name: storage

[SOURCE]

Now that we have a volume, we can simply mount the persistent disk on it –

volumes:
- name: storage
  gcePersistentDisk:
    pdName: data-index-disk
    fsType: ext4

[SOURCE]

Now, whenever we deploy these configurations, we can reuse the previous index.

Exposing Kubernetes to Cluster

The HTTP and transport clients are enabled on port 9200 and 9300 respectively. They can be exposed to the rest of the cluster using the following service –

apiVersion: v1
kind: Service
...
Spec:
  ...
  ports:
  - name: http
    port: 9200
    protocol: TCP
  - name: transport
    port: 9300
    protocol: TCP

[SOURCE]

Once deployed, other deployments can access the cluster API from ports 9200 and 9300.

Connecting loklak to Kubernetes

To connect loklak to external Elasticsearch cluster, TransportClient Java API is used. In order to enable these settings, we simply need to make some changes in configurations.

Since we enable the service named “elasticsearch” in namespace “elasticsearch”, we can access the cluster at address elasticsearch.elasticsearch:9200 (web) and elasticsearch.elasticsearch:9300 (transport).

To confine these changes only to Kubernetes deployment, we can use sed command while building the image (in Dockerfile) –

sed -i.bak 's/^\(elasticsearch_transport.enabled\).*/\1=true/' conf/config.properties && \
sed -i.bak 's/^\(elasticsearch_transport.addresses\).*/\1=elasticsearch.elasticsearch:9300/' conf/config.properties && \

[SOURCE]

Now when we create the deployments in Kubernetes cluster, loklak auto connects to the external elasticsearch index and creates indices if needed.

Verifying persistence of the Elasticsearch Index

In order to see that the data persists, we can completely delete the deployment or even the cluster if we want. Later, when we recreate the deployment, we can see all the messages already present in the index.

I  [2017-07-29 09:42:51,804][INFO ][node                     ] [Hellion] initializing ...
 
I  [2017-07-29 09:42:52,024][INFO ][plugins                  ] [Hellion] loaded [cloud-kubernetes], sites []
 
I  [2017-07-29 09:42:52,055][INFO ][env                      ] [Hellion] using [1] data paths, mounts [[/data (/dev/sdb)]], net usable_space [84.9gb], net total_space [97.9gb], spins? [possibly], types [ext4]
 
I  [2017-07-29 09:42:53,543][INFO ][node                     ] [Hellion] initialized
 
I  [2017-07-29 09:42:53,543][INFO ][node                     ] [Hellion] starting ...
 
I  [2017-07-29 09:42:53,620][INFO ][transport                ] [Hellion] publish_address {10.8.1.13:9300}, bound_addresses {10.8.1.13:9300}
 
I  [2017-07-29 09:42:53,633][INFO ][discovery                ] [Hellion] loklakcluster/cJtXERHETKutq7nujluJvA
 
I  [2017-07-29 09:42:57,866][INFO ][cluster.service          ] [Hellion] new_master {Hellion}{cJtXERHETKutq7nujluJvA}{10.8.1.13}{10.8.1.13:9300}{master=true}, reason: zen-disco-join(elected_as_master, [0] joins received)
 
I  [2017-07-29 09:42:57,955][INFO ][http                     ] [Hellion] publish_address {10.8.1.13:9200}, bound_addresses {10.8.1.13:9200}
 
I  [2017-07-29 09:42:57,955][INFO ][node                     ] [Hellion] started
 
I  [2017-07-29 09:42:58,082][INFO ][gateway                  ] [Hellion] recovered [8] indices into cluster_state

In the last line from the logs, we can see that indices already present on the disk were recovered. Now if we head to the public IP assigned to the cluster, we can see that the message count is restored.

Conclusion

In this blog post, I discussed how we utilised the Kubernetes setup to shift loklak to Google Cloud Platform. The deployment is active and can be accessed from the link provided under wiki section of loklak/loklak_server repo.

I introduced these changes in pull request loklak/loklak_server#1349 with the help of @niranjan94, @uday96 and @chiragw15.

Resources

Docker Tutorial Series : Writing a Dockerfile – https://rominirani.com/docker-tutorial-series-writing-a-dockerfile-ce5746617cd.
Introduction to YAML: Creating a Kubernetes deployment – https://www.mirantis.com/blog/introduction-to-yaml-creating-a-kubernetes-deployment/.
Configure Liveness and Readiness Probes – https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/.
Setting up HTTP Load Balancing with Ingress – https://cloud.google.com/container-engine/docs/tutorials/http-balancer.
Persistent Volumes in Kubernetes – https://kubernetes.io/docs/concepts/storage/persistent-volumes/.
Rolling updates with Kubernetes: Replication Controllers vs Deployments – https://ryaneschinger.com/blog/rolling-updates-kubernetes-replication-controllers-vs-deployments/.
DNS Pods and Services – https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/.

Caching Elasticsearch Aggregation Results in loklak Server

Post author:Pratyush
Post published:August 3, 2017
Post category:FOSSASIA loklak Tutorial
Post comments:0 Comments

To provide aggregated data for various classifiers, loklak uses Elasticsearch aggregations. Aggregated data speaks a lot more than a few instances from it can say. But performing aggregations on each request can be very resource consuming. So we needed to come up with a way to reduce this load.

In this post, I will be discussing how I came up with a caching model for the aggregated data from the Elasticsearch index.

Fields to Consider while Caching

At the classifier endpoint, aggregations can be requested based on the following fields –

Classifier Name
Classifier Classes
Countries
Start Date
End Date

But to cache results, we can ignore cases where we just require a few classes or countries and store aggregations for all of them instead. So the fields that will define the cache to look for will be –

Classifier Name
Start Date
End Date

Type of Cache

The data structure used for caching was Java’s HashMap. It would be used to map a special string key to a special object discussed in a later section.

Key

The key is built using the fields mentioned previously –

private static String getKey(String index, String classifier, String sinceDate, String untilDate) {
    return index + "::::"
        + classifier + "::::"
        + (sinceDate == null ? "" : sinceDate) + "::::"
        + (untilDate == null ? "" : untilDate);
}

[SOURCE]

In this way, we can handle requests where a user makes a request for every class there is without running the expensive aggregation job every time. This is because the key for such requests will be same as we are not considering country and class for this purpose.

Value

The object used as key in the HashMap is a wrapper containing the following –

json – It is a JSONObject containing the actual data.
expiry – It is the expiry of the object in milliseconds.

class JSONObjectWrapper {
    private JSONObject json; 
    private long expiry;
    ... 
}

Timeout

The timeout associated with a cache is defined in the configuration file of the project as “classifierservlet.cache.timeout”. It defaults to 5 minutes and is used to set the eexpiryof a cached JSONObject –

class JSONObjectWrapper {
    ...
    private static long timeout = DAO.getConfig("classifierservlet.cache.timeout", 300000);

    JSONObjectWrapper(JSONObject json) {
        this.json = json;
        this.expiry = System.currentTimeMillis() + timeout;
    }
    ...
}

Cache Hit

For searching in the cache, the previously mentioned string is composed from the parameters requested by the user. Checking for a cache hit can be done in the following manner –

String key = getKey(index, classifier, sinceDate, untilDate);
if (cacheMap.keySet().contains(key)) {
    JSONObjectWrapper jw = cacheMap.get(key);
    if (!jw.isExpired()) {
        // Do something with jw
    }
}
// Calculate the aggregations
...

But since jw here would contain all the data, we would need to filter out the classes and countries which are not needed.

Filtering results

For filtering out the parts which do not contain the information requested by the user, we can perform a simple pass and exclude the results that are not needed.

Since the number of fields to filter out, i.e. classes and countries, would not be that high, this process would not be that resource intensive. And at the same time, would save us from requesting heavy aggregation tasks from the user.

Since the data about classes is nested inside the respective country field, we need to perform two level of filtering –

JSONObject retJson = new JSONObject(true);
for (String key : json.keySet()) {
    JSONArray value = filterInnerClasses(json.getJSONArray(key), classes);
    if ("GLOBAL".equals(key) || countries.contains(key)) {
        retJson.put(key, value);
    }
}

Cache Miss

In the case of a cache miss, the helper functions are called from ElasticsearchClient.java to get results. These results are then parsed from HashMap to JSONObject and stored in the cache for future usages.

JSONObject freshCache = getFromElasticsearch(index, classifier, sinceDate, untilDate);
cacheMap.put(key, new JSONObjectWrapper(freshCache));

The getFromElasticsearch method finds all the possible classes and makes a request to the appropriate method in ElasticsearchClient, getting data for all classifiers and all countries.

Conclusion

In this blog post, I discussed the need for caching of aggregations and the way it is achieved in the loklak server. This feature was introduced in pull request loklak/loklak_server#1333 by @singhpratyush (me).

Resources

Using Elasticsearch Aggregations to Analyse Classifier Data in loklak Server – https://blog.fossasia.org/using-elasticsearch-aggregations-to-analyse-classifier-data-in-loklak-server/
High-Cardinality Memory Implications in Elasticsearch aggregations – https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html#_high_cardinality_memory_implications
An Introduction to Elasticsearch Aggregations – https://qbox.io/blog/elasticsearch-aggregations
Classifier servlet in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/api/aggregation/ClassifierServlet.java.

Using Elasticsearch Aggregations to Analyse Classifier Data in loklak Server

Post author:Pratyush
Post published:July 28, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

Loklak uses Elasticsearch to index Tweets and other social media entities. It also houses a classifier that classifies Tweets based on emotion, profanity and language. But earlier, this data was available only with the search API and there was no way to get aggregated data out of it. So as a part of my GSoC project, I proposed to introduce a new API endpoint which would allow users to access aggregated data from these classifiers.

In this blog post, I will be discussing how aggregations are performed on the Elasticsearch index of Tweets in the loklak server.

Structure of index

The ES index for Twitter is called messages and it has 3 fields related to classifiers –

classifier_emotion
classifier_language
classifier_profanity

With each of these classifiers, we also have a probability attached which represents the confidence of the classifier for assigned class to a Tweet. The name of these fields is given by suffixing the emotion field by _probability (e.g. classifier_emotion_probability).

Since I will also be discussing aggregation based on countries in this blog post, there is also a field named place_country_code which saves the ISO 3166-1 alpha-2 code for the country of creation of Tweet.

Requesting aggregations using Elasticsearch Java API

Elasticsearch comes with a simple Java API which can be used to perform any desired task. To work with data, we need an ES client which can be built from a ES Node (if creating a cluster) or directly as a transport client (if connecting remotely to a cluster) –

// Transport client
TransportClient tc = TransportClient.builder()
                        .settings(mySettings)
                        .build();

// From a node
Node elasticsearchNode = NodeBuilder.nodeBuilder()
                            .local(false).settings(mySettings)
                            .node();
Client nc = elasticsearchNode.client();

[SOURCE]

Once we have a client, we can use ES AggregationBuilder to get aggregations from an index –

SearchResponse response = elasticsearchClient.prepareSearch(indexName)
                            .setSearchType(SearchType.QUERY_THEN_FETCH)
                            .setQuery(QueryBuilders.matchAllQuery())  // Consider every row
                            .setFrom(0).setSize(0)  // 0 offset, 0 result size (do not return any rows)
                            .addAggregation(aggr)  // aggr is a AggregatoinBuilder object
                            .execute().actionGet();  // Execute and get results

[SOURCE]

AggregationBuilders are objects that define the properties of an aggregation task using ES’s Java API. This code snippet is applicable for any type of aggregation that we wish to perform on an index, given that we do not want to fetch any rows as a response.

Performing simple aggregation for a classifier

In this section, I will discuss the process to get results from a given classifier in loklak’s ES index. Here, we will be targeting a class-wise count of rows and stats (average and sum) of probabilities.

Writing AggregationBuilder

An AggregationBuilder for this task will be a Terms AggregationBuilder which would dynamically generate buckets for all the different values of fields for a given field in index –

AggregationBuilder getClassifierAggregationBuilder(String classifierName) {
    String probabilityField = classifierName + "_probability";
    return AggregationBuilders.terms("by_class").field(classifierName)
        .subAggregation(
            AggregationBuilders.avg("avg_probability").field(probabilityField)
        )
        .subAggregation(
            AggregationBuilders.sum("sum_probability").field(probabilityField)
        );
}

[SOURCE]

Here, the name of aggregation is passed as by_class. This will be used while processing the results for this aggregation task. Also, sub-aggregation is used to get average and sum probability by the name of avg_probability and sum_probability respectively. There is no need to specify to count rows as this is done by default.

Processing results

Once we have executed the aggregation task and received the SearchResponse as sr (say), we can use the name of top level aggregation to get Terms aggregation object –

Terms aggrs = sr.getAggregations().get("by_class");

After that, we can iterate through the buckets to get results –

for (Terms.Bucket bucket : aggrs.getBuckets()) {
String key = bucket.getKeyAsString();
long docCount = bucket.getDocCount(); // Number of rows
// Use name of sub aggregations to get results
Sum sum = bucket.getAggregations().get("sum_probability");
Avg avg = bucket.getAggregations().get("avg_probability");
// Do something with key, docCount, sum and avg
}

[SOURCE]

So in this manner, results from aggregation response can be processed.

Performing nested aggregations for different countries

The previous section described the process to perform aggregation over a single field. For this section, we’ll aim to get results for each country present in the index given a classifier field.

Writing a nested aggregation builder

To get the aggregation required, AggregationBuilder from previous section can be added as a sub-aggregation for the AggregationBuilder for country code field –

AggregationBuilder aggrs = AggregationBuilders.terms("by_country").field("place_country_code")
    .subAggregation(
        AggregationBuilders.terms("by_class").field(classifierName)
            .subAggregation(
                AggregationBuilders.avg("avg_probability").field(probabilityField)
            )
            .subAggregation(
                AggregationBuilders.sum("sum_probability").field(probabilityField)
            );
    );

[SOURCE]

Processing the results

Again, we can get the results by processing the AggregationBuilders by name in a top-to-bottom fashion –

Terms aggrs = response.getAggregations().get("by_country");
for (Terms.Bucket bucket : aggr.getBuckets()) {
    String countryCode = bucket.getKeyAsString();
    Terms classAggrs = bucket.getAggregations().get("by_class");
    for (Terms.Bucket classBucket : classAggr.getBuckets()) {
        String key = classBucket.getKeyAsString();
        long docCount = classBucket.getDocCount();
        Sum sum = classBucket.getAggregations().get("sum_probability");
        Avg avg = classBucket.getAggregations().get("avg_probability");
        ...
    }
    ...
}

[SOURCE]

And we have the data about classifier for each country present in the index.

Conclusion

This blog post explained about Elasticsearch aggregations and their usage in the loklak server project. The changes discussed here were introduced over a series of patches to ElasticsearchClient.java by @singhpratyush (me).

Resources

Elasticsearch aggregations with Java API – https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-aggs.html.
Classifier servlet in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/api/aggregation/ClassifierServlet.java.
Counting in ElasticSearch – http://tech.opentable.co.uk/blog/2013/09/11/counting-in-elastic-search/
An Introduction to Elasticsearch Aggregations – https://qbox.io/blog/elasticsearch-aggregations

Making loklak Server’s Kaizen Harvester Extendable

Post author:Pratyush
Post published:July 28, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Harvesting strategies in loklak are something that the end users can’t see, but they play a vital role in deciding the way in which messages are collected with loklak. One of the strategies in loklak is defined by the Kaizen Harvester, which generates queries from collected messages.

The original strategy used a simple hash queue which drops queries once it is full. This effect is not desirable as we tend to lose important queries in this process if they come up late while harvesting. To overcome this behaviour without losing important search queries, we needed to come up with new harvesting strategy(ies) that would provide a better approach for harvesting. In this blog post, I am discussing the changes made in the kaizen harvester so it can be extended to create different flavors of harvesters.

What can be different in extended harvesters?

To make the Kaizen harvester extendable, we first needed to decide that what are the parts in the original Kaizen harvester that can be changed to make the strategy different (and probably better).

Since one of the most crucial part of the Kaizen harvester was the way it stores queries to be processed, it was one of the most obvious things to change. Another thing that should be allowed to configure across various strategies was the decision of whether to go for harvesting the queries from the query list.

Query storage with KaizenQueries

To allow different methods of storing the queries, KaizenQueries class was introduced in loklak. It was configured to provide basic methods that would be required for a query storing technique to work. A query storing technique can be any data structure that we can use to store search queries for Kaizen harvester.

public abstract class KaizenQueries {

     public abstract boolean addQuery(String query);

     public abstract String getQuery();

     public abstract int getSize();

     public abstract int getMaxSize();

     public boolean isEmpty() {
         return this.getSize() == 0;
     }
}

[SOURCE]

Also, a default type of KaizenQueries was introduced to use in the original Kaizen harvester. This allowed the same interface as the original queue which was used in the harvester.

Another constructor was introduced in Kaizen harvester which allowed setting the KaizenQueries for an instance of its derived classes. It solved the problem of providing an interface of KaizenQueries inside the Kaizen harvester which can be used by any inherited strategy –

private KaizenQueries queries = null;

public KaizenHarvester(KaizenQueries queries) {
    ...
    this.queries = queries;
    ...
}

public void someMethod() {
    ...
    this.queries.addQuery(someQuery);
    ...
}

[SOURCE]

This being added, getting new queries or adding new queries was a simple. We just need to use getQuery() and addQuery() methods without worrying about the internal implementations.

Configurable decision for harvesting

As mentioned earlier, the decision taken for harvesting should also be configurable. For this, a protected method was implemented and used in harvest() method –

protected boolean shallHarvest() {
    float targetProb = random.nextFloat();
    float prob = 0.5F;
    if (this.queries.getMaxSize() > 0) {
        prob = queries.getSize() / (float)queries.getMaxSize();
    }
    return !this.queries.isEmpty() && targetProb < prob;
}

@Override
public int harvest() {
    if (this.shallHarvest()) {
        return harvestMessages();
    }

    grabSuggestions();

    return 0;
}

[SOURCE]

The shallHarvest() method can be overridden by derived classes to allow any type of harvesting decision that they want. For example, we can configure it in such a way that it harvests if there are any queries in the queue, or to use a different probability distribution instead of linear (maybe gaussian around 1).

Conclusion

This blog post explained about the changes made in loklak’s Kaizen harvester which allowed other harvesting strategies to be built on its top. It discussed the two components changed and how they allowed ease of inheriting the original Kaizen harvester. These changes were proposed in PR loklak/loklak_server#1203 by @singhpratyush (me).

Resources

Harvesting strategies in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy.
Improving Harvesting Decision for Kaizen Harvester in loklak server – https://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/.
Kaizen harvester before the change – https://github.com/loklak/loklak_server/blob/a0ccbdfdd86b128824e56235c877e942cee7c325/src/org/loklak/harvester/strategy/KaizenHarvester.java.

Adding React based World Mood Tracker to loklak Apps

Post author:Pratyush
Post published:July 24, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

loklak apps is a website that hosts various apps that are built by using loklak API. It uses static pages and angular.js to make API calls and show results from users. As a part of my GSoC project, I had to introduce the World Mood Tracker app using loklak’s mood API. But since I had planned to work on React, I had to go off from the track of typical app development in loklak apps and integrate a React app in apps.loklak.org.

In this blog post, I will be discussing how I introduced a React based app to apps.loklak.org and solved the problem of country-wise visualisation of mood related data on a World map.

Setting up development environment inside apps.loklak.org

After following the steps to create a new app in apps.loklak.org, I needed to add proper tools and libraries for smooth development of the World Mood Tracker app. In this section, I’ll be explaining the basic configuration that made it possible for a React app to be functional in the angular environment.

Pre-requisites

The most obvious prerequisite for the project was Node.js. I used node v8.0.0 while development of the app. Instead of npm, I decided to go with yarn because of offline caching and Internet speed issues in India.

Webpack and Babel

To begin with, I initiated yarn in the app directory inside project and added basic dependencies –

$ yarn init
$ yarn add webpack webpack-dev-server path
$ yarn add babel-loader babel-core babel-preset-es2015 babel-preset-react --dev

Next, I configured webpack to set an entry point and output path for the node project in webpack.config.js –

module.exports = {
    entry: './js/index.js',
    output: {
        path: path.resolve('.'),
        filename: 'index_bundle.js'
    },
    ...
};

This would signal to look for ./js/index.js as an entry point while bundling. Similarly, I configured babel for es2015 and React presets –

{
  "presets":[
    "es2015", "react"
  ]
}

After this, I was in a state to define loaders for module in webpack.config.js. The loaders would check for /\.js$/ and /\.jsx$/ and assign them to babel-loader (with an exclusion of node_modules).

React

After configuring the basic presets and loaders, I added React to dependencies of the project –

$ yarn add react react-dom

The React related files needed to be in ./js/ directory so that the webpack can bundle it. I used the file to create a simple React app –

import React from 'react';
import ReactDOM from 'react-dom';

ReactDOM.render(
    <div>World Mood Tracker</div>,
    document.getElementById('app')
);

After this, I was in a stage where it was possible to use this app as a part of apps.loklak.org app. But to do this, I first needed to compile these files and bundle them so that the external app can use it.

Configuring the build target for webpack

In apps.loklak.org, we need to have a file by the name of index.html in the app’s root directory. Here, we also needed to place the bundled js properly so it could be included in index.html at app’s root.

HTML Webpack Plugin

Using html-webpack-plugin, I enabled auto building of project in the app’s root directory by using the following configuration in webpack.config.js –

...
const HtmlWebpackPlugin = require('html-webpack-plugin');
const HtmlWebpackPluginConfig = new HtmlWebpackPlugin({
    template: './js/index.html',
    filename: 'index.html',
    inject: 'body'
});
module.exports = {
    ...
    plugins: [HtmlWebpackPluginConfig]
};

This would build index.html at app’s root that would be discoverable externally.

To enable bundling of the project using simple yarn build command, the following lines were added to package.json –

{
  ..
  "scripts": {
    ..
    "build": "webpack -p"
  }
}

After a simple yarn build command, we can see the bundled js and html being created at the app root.

Using datamaps for visualization

Datamaps is a JS library which allows plotting of data on map using D3 as backend. It provides a simple interface for creating visualizations and comes with a handy npm installation –

$ yarn add datamaps

Map declaration and usage as state

A map from datamaps was used a state for React component which allowed fast rendering of changes in the map as the state of React component changes –

export default class WorldMap extends React.Component {
    constructor() {
        super();
        this.state = {
            map: null
        };
    render() {
        return (<div className={styles.container}>
                <div id="map-container"></div></div>)
    }
    componentDidMount() {
        this.setState({map: new Datamap({...})});
    }
    ...
}

The declaration of map goes in componentDidMount method because it would not be possible to start the map until we have the div with id=”map-container” in the DOM. It was necessary to draw the map only after the component has mounted otherwise it would fail due to no id=”map-container” in the DOM.

Defining data for countries

Data for every country had two components –

data = {
    positiveScore: someValue,
    negativeScore: someValue
}

This data is used to generate popup for the counties –

this.setState({
    map: new Datamap({
        ...
        geographyConfig: {
            ...
            popupTemplate: function (geo, data) {
                // Configure variables pScore so that it gives “No Data” when data.positiveScore is not set (similar for negative)
                return [
                    // Use pScore and nScore to generate results here
                    // geo.properties.name would give current country name
                ].join('');
            }
        }
    })
});

The result for countries with unknown data values look something like this –

Conclusion

In this blog post, I explained about introducing a React based app in app.loklak.org’s angular based environment. I discussed the setup and bundling process of the project so it becomes available from the project’s external HTTP server.

I also discussed using datamaps as a visualisation tool for data about Tweets. The app was first introduced in pull request fossasia/apps.loklak.org#189 and was improved step by step in subsequent patches.

Resources

About BabelJS – https://babeljs.io/.
React documentation – https://facebook.github.io/react/docs/hello-world.html.
HTML-Webpack-Plugin – https://github.com/jantimon/html-webpack-plugin.
loklak’s aggregation API – https://github.com/loklak/loklak_server/tree/development/src/org/loklak/api/aggregation.