Automatic Imports of Events to Open Event from online event sites with Query Server and Event Collect

One goal for the next version of the Open Event project is to allow an automatic import of events from various event listing sites. We will implement this using Open Event Import APIs and two additional modules: Query Server and Event Collect. The idea is to run the modules as micro-services or as stand-alone solutions.

Query Server
The query server is, as the name suggests, a query processor. As we are moving towards an API-centric approach for the server, query-server also has API endpoints (v1). Using this API you can get the data from the server in the mentioned format. The API itself is quite intuitive.

API to get data from query-server

GET /api/v1/search/<search-engine>/query=query&format=format

Sample Response Header

 Cache-Control: no-cache
 Connection: keep-alive
 Content-Length: 1395
 Content-Type: application/xml; charset=utf-8
 Date: Wed, 24 May 2017 08:33:42 GMT
 Server: Werkzeug/0.12.1 Python/2.7.13
 Via: 1.1 vegur

The server is built in Flask. The GitHub repository of the server contains a simple Bootstrap front-end, which is used as a testing ground for results. The query string calls the search engine result scraper scraper.py that is based on the scraper at searss. This scraper takes search engine, presently Google, Bing, DuckDuckGo and Yahoo as additional input and searches on that search engine. The output from the scraper, which can be in XML or in JSON depending on the API parameters is returned, while the search query is stored into MongoDB database with the query string indexing. This is done keeping in mind the capabilities to be added in order to use Kibana analyzing tools.

The frontend prettifies results with the help of PrismJS. The query-server will be used for initial listing of events from different search engines. This will be accessed through the following API.

The query server app can be accessed on heroku.

➢ api/list​: To provide with an initial list of events (titles and links) to be displayed on Open Event search results.

When an event is searched on Open Event, the query is passed on to query-server where a search is made by calling scraper.py with appending some details for better event hunting. Recent developments with Google include their event search feature. In the Google search app, event searches take over when Google detects that a user is looking for an event.

The feed from the scraper is parsed for events inside query server to generate a list containing Event Titles and Links. Each event in this list is then searched for in the database to check if it exists already. We will be using elastic search to achieve fuzzy searching for events in Open Event database as elastic search is planned for the API to be used.

One example of what we wish to achieve by implementing this type of search in the database follows. The user may search for

-Google Cloud Event Delhi
-Google Event, Delhi
-Google Cloud, Delhi
-google cloud delhi
-Google Cloud Onboard Delhi
-Google Delhi Cloud event

All these searches should match with “Google Cloud Onboard Event, Delhi” with good accuracy. After removing duplicates and events which already exist in the database from this list have been deleted, each event is rendered on search frontend of Open Event as a separate event. The user can click on any of these event, which will make a call to event collect.

Event Collect

The event collect project is developed as a separate module which has two parts

● Site specific scrapers
In its present state, event collect has scrapers for eventbrite and ticket-leap which, given a query, scrape eventbrite (and ticket-leap respectively) search results and downloads JSON files of each event using Loklak‘s API.
The scrapers can be developed in any form or any number of scrapers/scraping tools can be added as long as they are in alignment with the Open Event Import API’s data format. Writing tests for these against the concurrent API formats will take care of this. This part will be covered by using a json-validator​ to check against a pre-generated schema.

● REST APIs
The scrapers are exposed through a set of APIs, which will include, but not limited to,
➢ api/fetch-event : ​to scrape any event given the link and compose the data in a predefined JSON format which will be generated based on Open Event Import API. When this function is called on an event link, scrapers are invoked which collect event data such as event, meta, forms etc. This data will be validated against the generated JSON schema. The scraped JSON and directory structure for media files:
➢ api/export : to export all the JSON data containing event information into Open Event Server. As and when the scraping is complete, the data will be added into Open Event’s database as a new event.

How the Import works

The following graphic shows how the import works.




Let’s dive into the workflow. So as the diagram illustrates, the ‘search​’ functionality makes a call to api/list API endpoint provided by query-server which returns with events’ ‘Title’ and ‘Event Link’ from the parsed XML/JSON feed. This list is displayed as Open Event’s search results. Now the results having been displayed, the user can click on any of the events. When the user clicks on any event, the event is searched for in Open Event’s database. Two things happen now:

  • The event page loads if the event is found.
  • If the event does not already exist in the database, clicking on any event will

➢ Insert this event’s title and link in the database and get the event_id

➢ Make a call to api/fetch-event in event-collect which then invokes a site-specific scraper to fetch data about the event the user has chosen

➢ When the data is scraped, it is imported into Open Event database using the previously generated event_id. The page will be loaded using jquery ajax ​as and when the scraping is done.​When the imports are done, the search page refreshes with the new results. The Open Event Orga Server exposes a well documented REST API that can be used by external services to access the data.

Sorting language-translation in Open Event Server project using Jinja 2 dictsort.

Working on the Open Event Server project an issue about arranging language-translation listing in alphabetical order came up. To solve this issue of language listing arrangement i.e. #2817, I found the ‘d0_dictsort’ function in jinja2 to sort dictionaries. It is a defined in jinja2.filters. Python dicts are unsorted and in our web application we at times may want to order them by either their key or value. So this function comes handy.

This is what the function looks like:

do_dictsort(value, case_sensitive=False, by='key')

We can write them in three ways as:

{% for record in my_dictionary|dictsort %}
    case insensitive and sort the dict by key

{% for record in my_dictionary|dicsort(true) %}
    case sensitive and sort the dict by key

{% for record in my_dictionary|dictsort(false, 'value') %}
    sort the dict by value, normally sorted and case insensitive
  1.       The first way is easily understood that dict has been sorted by key not taking case into consideration. It is just in the same way written as dictsort(false).
  2.       Second way is basically the first being case sensitive. dictsort(true) here tells us that case is sensitive.
  3.      Third way is dictsort(false,’value’). The first parameter defines that case insensitive while second parameter defines that it is sorted by ‘value’.

The issues was to sort translation selector for the page in alphabetical order. The languages were stored in a dictionary which to change in order, I found this function very easy and useful.

Basically what we had was:

This is how the function was used in the code for the sort. Like this:

<ul class="dropdown-menu lang-list">
   {% for code in all-languages|dictsort(false,'value') %}
       <li><a  href="#" style="#969191" class="translate" id="{{ code[0] }}">{{  all_languages[code[0]] }}<>a><li>
    {% endfor %}
<ul>


Here:
{{ all_languages }} is the list which contained the languages like French, English, etc., which could be accessed with its global language code. code here(index for all_languages) is a tuple of {‘global_language_code’,’language’} (An example would be (‘fr’,’French’), so code[0] gave me the language_code.

Finally, the result:

This is one of the simple ways to sort your dictionaries.

Open Event Server: No (no-wrap) Ellipsis using jquery!

Yes, the title says it all i.e., Enabling multiple line ellipsis. This was used to solve an issue to keep Session abstract view within 200 characters (#3059) on FOSSASIA‘s Open Event Server project.

There is this one way to ellipsis a paragraph in html-css and that is by using the text-overflow property:

.div_class{
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
}’’

But the downside of this is the one line ellipis. Eg: My name is Medozonuo. I am…..

And here you might pretty much want to ellipsis after a few characters in multiple lines, given that your div space is small and you do want to wrap your paragraph. Or maybe not.

So jquery to the rescue.

There are two ways you can easily do this multiple line ellipsis:

1) Height-Ellipsis (Using the do-while loop):

//script:
if ($('.div_class').height() > 100) {
    var words = $('.div_class').html().split(/\s+/);
    words.push('...');

    do {
        words.splice(-2, 1);
        $('.div_class').html( words.join(' ') );
    } while($('.div_class').height() > 100);
}

Here, you check for the div content’s height and split the paragraph after that certain height and add a “…”, do- while making sure that the paragraphs are in multiple lines and not in one single line. But checkout for that infinite loop.

2) Length-Ellipsis (Using substring function):  

//script:
$.each($('.div_class'), function() {
        if ($(this).html().length > 100) {
               var cropped_words = $(this).html();
               cropped_words = cropped_words.substring(0, 200) + "...";
               $(this).html(cropped_words);
        }
 });

Here, you check for the length/characters rather than the height, take in the substring of the content starting from 0-th character to the 200-th character and then add in extra “…”.

This is exactly how I used it in the code.

$.each($('.short_abstract',function() {
   if ($(this).html().length > 200) {
       var  words = $(this).html();
       words = words.substring(0,200 + "...";
       $(this).html(words);
    }
});


So ellipsing paragraphs over heights and lengths can be done using jQuery likewise.

ember.js – the right choice for the Open Event Front-end

With the development of the API server for the Open Event project we needed to decide which framework to choose for the new Open Event front-end. With the plethora of javascript frameworks available, it got really difficult to decide, which one is actually the right choice. Every month a new framework arrives, and the existing ones keep actively updating themselves often. We decided to go with Ember.js. This article covers the emberJS framework and highlights its advantages over others and  demonstrates its usefulness.

EmberJS is an open-source JavaScript application front end framework for creating web applications, and uses Model-View-Controller (MVC) approach. The framework provides universal data binding. It’s focus lies on scalability.

Why is Ember JS great?

Convention over configuration – It does all the heavy lifting.

Ember JS mandates best practices, enforces naming conventions and generates the boilerplate code for the various components and routes itself. This has advantages other than uniformity. It is easier for other developers to join the project and start working right away, instead of spending hours on existing codebase to understand it, as the core structure of all ember apps is similar. To get an ember app started with the basic route, user doesn’t has to do much, ember does all the heavy lifting.

ember new my-app
ember server

After installing this is all it takes to create your app.

Ember CLI

Similar to Ruby on Rails, ember has a powerful CLI. It can be used to generate boiler plate codes for components, routes, tests and much more. Testing is possible via the CLI as well.

ember generate component my-component
ember generate route my-route
ember test

These are some of the examples which show how easy it is to manage the code via the ember CLI.

Tests.Tests.Tests.

Ember JS makes it incredibly easy to use test-first approach. Integration tests, acceptance tests, and unit tests are in built into the framework. And can be generated from the CLI itself, the documentation on them is well written and it’s really easy to customise them.

ember generate acceptance-test my-test

This is all it takes to set up the entire boiler plate for the test, which you can customise

Excellent documentation and guides

Ember JS has one of the best possible documentations available for a framework. The guides are a breeze to follow. It is highly recommended that, if starting out on ember, make the demo app from the official ember Guides. That should be enough to get familiar with ember.

Ember Guides is all you need to get started.

Ember Data

It sports one of the best implemented API data fetching capabilities. Fetching and using data in your app is a breeze. Ember comes with an inbuilt data management library Ember Data.

To generate a data model via ember CLI , all you have to do is

ember generate model my-model

Where is it being used?

Ember has a huge community and is being used all around. This article focuses on it’s salient features via the example of Open Event Orga Server project of FOSSASIA. The organizer server is primarily based on FLASK with jinja2 being used for rendering templates. At the small scale, it was efficient to have both the front end and backend of the server together, but as it grew larger in size with more refined features it became tough to keep track of all the minor edits and customizations of the front end and the code started to become complex in nature. And that gave birth to the new project Open Event Front End which is based on ember JS which will be covered in the next week.

With the orga server being converted into a fully functional API, the back end and the front end will be decoupled thereby making the code much cleaner and easy to understand for the other developers that may wish to contribute in the future. Also, since the new front end is being designed with ember JS, it’s UI will have a lot of enhanced features and enforcing uniformity across the design would be much easier with the help of components in ember. For instance, instead of making multiple copies of the same code, components are used to avoid repetition and ensure uniformity (change in one place will reflect everywhere)

<.div class="{{if isWide 'event wide ui grid row'}}">
  {{#if isWide}}
    {{#unless device.isMobile}}
      <.div class="ui card three wide computer six wide tablet column">
        <.a class="image" href="{{href-to 'public' event.identifier}}">
          {{widgets/safe-image src=(if event.large event.large event.placeholderUrl)}}
        <./a>
      <./div>
    {{/unless}}
  {{/if}}
  <.div class="ui card {{unless isWide 'event fluid' 'thirteen wide computer ten wide tablet sixteen wide mobile column'}}">
    {{#unless isWide}}
      <.a class="image" href="{{href-to 'public' event.identifier}}">
        {{widgets/safe-image src=(if event.large event.large event.placeholderUrl)}}
      <./a>
    {{/unless}}
    <.div class="main content">
      <.a class="header" href="{{href-to 'public' event.identifier}}">
        <.span>{{event.name}}<./span>
      <./a>
      <.div class="meta">
        <.span class="date">
          {{moment-format event.startTime 'ddd, MMM DD HH:mm A'}}
        <./span>
      <./div>
      <.div class="description">
        {{event.shortLocationName}}
      <./div>
    <./div>
    <.div class="extra content small text">
      <.span class="right floated">
        <.i role="button" class="share alternate link icon" {{action shareEvent event}}><./i>
      <./span>
      <.span>
        {{#if isYield}}
          {{yield}}
        {{else}}
          {{#each tags as |tag|}}
            <.a>{{tag}}<./a>
          {{/each}}
        {{/if}}
      <./span>
    <./div>
  <./div>
<./div>

This is a perfect example of the power of components in ember, this is a component for event information display in a card format which in addition to being rendered differently for various screen sizes can act differently based on passed parameters, thereby reducing the redundancy of writing separate components for the same.

Ember is a step forward towards the future of the web. With the help of Babel.js it is possible to write ES6/2015 syntax and not worry about it’s compatibility with the browsers. It will take care of it.

This is perfectly valid and will be compatible with majority of the supported browsers.

actions: {
  submit() {
    this.onValid(()=> {
    });
  }
}

 

Some references used for the blog article:

  1. https://www.codeschool.com/blog/2015/10/26/7-reasons-to-use-ember-js/
  2. https://www.quora.com/What-are-the-advantages-of-using-Ember-js
  3. Official Ember Guides: https://guides.emberjs.com

 
This page/product/etc is unaffiliated with the Ember project. Ember is a trademark of Tilde Inc

DetachedInstanceError: dealing with Celery, Flask’s app context and SQLAlchemy

In the open event server project, we had chosen to go with celery for async background tasks. From the official website,

What is celery?

Celery is an asynchronous task queue/job queue based on distributed message passing.

What are tasks?

The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing.

After the tasks had been set up, an error constantly came up whenever a task was called

The error was:

DetachedInstanceError: Instance <User at 0x7f358a4e9550> is not bound to a Session; attribute refresh operation cannot proceed

The above error usually occurs when you try to access the session object after it has been closed. It may have been closed by an explicit session.close() call or after committing the session with session.commit().

The celery tasks in question were performing some database operations. So the first thought was that maybe these operations might be causing the error. To test this theory, the celery task was changed to :

@celery.task(name='lorem.ipsum')
def lorem_ipsum():
    pass

But sadly, the error still remained. This proves that the celery task was just fine and the session was being closed whenever the celery task was called. The method in which the celery task was being called was of the following form:

def restore_session(session_id):
    session = DataGetter.get_session(session_id)
    session.deleted_at = None
    lorem_ipsum.delay()
    save_to_db(session, "Session restored from Trash")
    update_version(session.event_id, False, 'sessions_ver')


In our app, the app_context was not being passed whenever a celery task was initiated. Thus, the celery task, whenever called, closed the previous app_context eventually closing the session along with it. The solution to this error would be to follow the pattern as suggested on http://flask.pocoo.org/docs/0.12/patterns/celery/.

def make_celery(app):
    celery = Celery(app.import_name, broker=app.config['CELERY_BROKER_URL'])
    celery.conf.update(app.config)
    task_base = celery.Task

    class ContextTask(task_base):
        abstract = True

        def __call__(self, *args, **kwargs):
            if current_app.config['TESTING']:
                with app.test_request_context():
                    return task_base.__call__(self, *args, **kwargs)
            with app.app_context():
                return task_base.__call__(self, *args, **kwargs)

    celery.Task = ContextTask
    return celery

celery = make_celery(current_app)


The __call__ method ensures that celery task is provided with proper app context to work with.

 

Event-driven programming in Flask with Blinker signals

Setting up blinker:

The Open Event Project offers event managers a platform to organize all kinds of events including concerts, conferences, summits and regular meetups. In the server part of the project, the issue at hand was to perform multiple tasks in background (we use celery for this) whenever some changes occurred within the event, or the speakers/sessions associated with the event.

The usual approach to this would be applying a function call after any relevant changes are made. But the statements making these changes were distributed all over the project at multiple places. It would be cumbersome to add 3-4 function calls (which are irrelevant to the function they are being executed) in so may places. Moreover, the code would get unstructured with this and it would be really hard to maintain this code over time.

That’s when signals came to our rescue. From Flask 0.6, there is integrated support for signalling in Flask, refer http://flask.pocoo.org/docs/latest/signals/ . The Blinker library is used here to implement signals. If you’re coming from some other language, signals are analogous to events.

Given below is the code to create named signals in a custom namespace:


from blinker import Namespace

event_signals = Namespace()
speakers_modified = event_signals.signal('event_json_modified')

If you want to emit a signal, you can do so by calling the send() method:


speakers_modified.send(current_app._get_current_object(), event_id=event.id, speaker_id=speaker.id)

From the user guide itself:

“ Try to always pick a good sender. If you have a class that is emitting a signal, pass self as sender. If you are emitting a signal from a random function, you can pass current_app._get_current_object() as sender. “

To subscribe to a signal, blinker provides neat decorator based signal subscriptions.


@speakers_modified.connect
def name_of_signal_handler(app, **kwargs):

 

Some Design Decisions:

When sending the signal, the signal may be sending lots of information, which your signal may or may not want. e.g when you have multiple subscribers listening to the same signal. Some of the information sent by the signal may not be of use to your specific function. Thus we decided to enforce the pattern below to ensure flexibility throughout the project.


@speakers_modified.connect
def new_handler(app, **kwargs):
# do whatever you want to do with kwargs['event_id']

In this case, the function new_handler needs to perform some task solely based on the event_id. If the function was of the form def new_handler(app, event_id), an error would be raised by the app. A big plus of this approach, if you want to send some more info with the signal, for the sake of example, if you also want to send speaker_name along with the signal, this pattern ensures that no error is raised by any of the subscribers defined before this change was made.

When to use signals and when not ?

The call to send a signal will of course be lying in another function itself. The signal and the function should be independent of each other. If the task done by any of the signal subscribers, even remotely affects your current function, a signal shouldn’t be used, use a function call instead.

How to turn off signals while testing?

When in testing mode, signals may slow down your testing as unnecessary signals subscribers which are completely independent from the function being tested will be executed numerous times. To turn off executing the signal subscribers, you have to make a small change in the send function of the blinker library.

Below is what we have done. The approach to turn it off may differ from project to project as the method of testing differs. Refer https://github.com/jek/blinker/blob/master/blinker/base.py#L241 for the original function.


def new_send(self, *sender, **kwargs):
    if len(sender) == 0:
        sender = None
    elif len(sender) > 1:
        raise TypeError('send() accepts only one positional argument, '
                        '%s given' % len(sender))
    else:
        sender = sender[0]
    # only this line was changed
    if not self.receivers or app.config['TESTING']:
        return []
    else:
        return [(receiver, receiver(sender, **kwargs))
                for receiver in self.receivers_for(sender)]
                
Signal.send = new_send

event_signals = Namespace
# and so on ....

That’s all for now. Have some fun signaling 😉 .

 

Set proper content type when uploading files on s3 with python-magic

In the open-event-orga-server project, we had been using Amazon s3 storage for a long time now. After some time we encountered an issue that no matter what the file type was, the Content-Type when retrieving this files from the storage solution was application/octet-stream.

An example response when retrieving an image from s3 was as follows:


Accept-Ranges →bytes
Content-Disposition →attachment; filename=HansBakker_111.jpg
Content-Length →56060
Content-Type →application/octet-stream
Date →Fri, 09 Sep 2016 10:51:06 GMT
ETag →"964b1d839a9261fb0b159e960ceb4cf9"
Last-Modified →Tue, 06 Sep 2016 05:06:23 GMT
Server →AmazonS3
x-amz-id-2 →1GnO0Ta1e+qUE96Qgjm5ZyfyuhMetjc7vfX8UWEsE4fkZRBAuGx9gQwozidTroDVO/SU3BusCZs=
x-amz-request-id →ACF274542E950116

 

As seen above instead of providing image/jpeg as the Content-Type, it provides the Content-Type as application/octet-stream.While uploading the files, we were not providing the content type explicitly, which seemed to be the root of the problem.

It was decided that we would be providing the content type explicitly, so it was time to choose an efficient library to determine the file type based on the content of the file and not the file extension. After researching through the available libraries python-magic seemed to be the obvious choice. python-magic is a python interface to the libmagic file type identification library. libmagic identifies file types by checking their headers according to a predefined list of file types.

Here is an example straight from python-magic‘s readme on its usage:


>>> import magic
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'

 

Given below is a code snippet for the s3 upload function in the project:


file_data = file.read()
    file_mime = magic.from_buffer(file_data, mime=True)
    size = len(file_data)
    # k is defined as  k = Key(bucket) in previous code
    sent = k.set_contents_from_string(
        file_data,
        headers={
            'Content-Disposition': 'attachment; filename=%s' % filename,
            'Content-Type': '%s' % file_mime
        }
    ) 

 

One thing to note that as python-magic uses libmagic-dev as a dependency and many of the distros do not come with libmagic-dev pre-installed, make sure you install libmagic-dev explicitly. (Installation instructions may vary per distro)


sudo apt-get install libmagic-dev

Voila !! Now when retrieving each and every file you’ll get the proper content type.

 

Generating a documentation site from markup documents with Sphinx and Pandoc

Generating a fully fledged website from a set of markup documents is no easy feat. But thanks to the wonderful tool sphinx, it certainly makes the task easier. Sphinx does the heavy lifting of generating a website with built in javascript based search. But sometimes it’s not enough.

This week we were faced with two issues related to documentation generation on loklak_server and susi_server. First let me give you some context. Now sphinx requires an index.rst file within /docs/  which it uses to generate the first page of the site. A very obvious way to fill it which helps us avoid unnecessary duplication is to use the include directive of reStructuredText to include the README file from the root of the repository.

This leads to the following two problems:

  • Include directive can only properly include a reStructuredText, not a markdown document. Given a markdown document, it tries to parse the markdown as  reStructuredText which leads to errors.
  • Any relative links in README break when it is included in another folder.

To fix the first issue, I used pypandoc, a thin wrapper around Pandoc. Pandoc is a wonderful command line tool which allows us to convert documents from one markup format to another. From the official Pandoc website itself,

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

pypandoc requires a working installation of Pandoc, which can be downloaded and installed automatically using a single line of code.

pypandoc.download_pandoc()

This gives us a cross-platform way to download pandoc without worrying about the current platform. Now, pypandoc leaves the installer in the current working directory after download, which is fine locally, but creates a problem when run on remote systems like Travis. The installer could get committed accidently to the repository. To solve this, I had to take a look at source code for pypandoc and call an internal method, which pypandoc basically uses to set the name of the installer. I use that method to find out the name of the file and then delete it after installation is over. This is one of many benefits of open-source projects. Had pypandoc not been open source, I would not have been able to do that.

url = pypandoc.pandoc_download._get_pandoc_urls()[0][pf]
filename = url.split(‘/’)[-1]
os.remove(filename)

Here pf is the current platform which can be one of ‘win32’, ‘linux’, or ‘darwin’.

Now let’s take a look at our second issue. To solve that, I used regular expressions to capture any relative links. Capturing links were easy. All links in reStructuredText are in the same following format.

`Title <url>`__

Similarly links in markdown are in the following format

[Title](url)

Regular expressions were the perfect candidate to solve this. To detect which links was relative and need to be fixed, I checked which links start with the \docs\ directory and then all I had to do was remove the \docs prefix from those links.

A note about loklak and susi server project

Loklak is a server application which is able to collect messages from various sources, including twitter.

SUSI AI is an intelligent Open Source personal assistant. It is capable of chat and voice interaction and by using APIs to perform actions such as music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, and other real time information

Using NodeBuilder to instantiate node based Elasticsearch client and Visualising data

As elastic.io mentions, Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. But in many setups, it is not possible to manually install an Elasticsearch node on a machine. To handle these type of scenarios, Elasticsearch provides the NodeBuilder module, which can be used to spawn Elasticsearch node programmatically. Let’s see how.

Getting Dependencies

In order to get the ES Java API, we need to add the following line to dependencies.

compile group: 'org.elasticsearch', name: 'securesm', version: '1.0'

The required packages will be fetched the next time we gradle build.

Configuring Settings

In the Elasticsearch Java API, Settings are used to configure the node(s). To create a node, we first need to define its properties.

Settings.Builder settings = new Settings.Builder();

settings.put("cluster.name", "cluster_name");  // The name of the cluster

// Configuring HTTP details
settings.put("http.enabled", "true");
settings.put("http.cors.enabled", "true");
settings.put("http.cors.allow-origin", "https?:\/\/localhost(:[0-9]+)?/");  // Allow requests from localhost
settings.put("http.port", "9200");

// Configuring TCP and host
settings.put("transport.tcp.port", "9300");
settings.put("network.host", "localhost");

// Configuring node details
settings.put("node.data", "true");
settings.put("node.master", "true");

// Configuring index
settings.put("index.number_of_shards", "8");
settings.put("index.number_of_replicas", "2");
settings.put("index.refresh_interval", "10s");
settings.put("index.max_result_window", "10000");

// Defining paths
settings.put("path.conf", "/path/to/conf/");
settings.put("path.data", "/path/to/data/");
settings.put("path.home", "/path/to/data/");

settings.build();  // Buid with the assigned configurations

There are many more settings that can be tuned in order to get desired node configuration.

Building the Node and Getting Clients

The Java API makes it very simple to launch an Elasticsearch node. This example will make use of settings that we just built.

Node elasticsearchNode = NodeBuilder.nodeBuilder().local(false).settings(settings).node();

A piece of cake. Isn’t it? Let’s get a client now, on which we can execute our queries.

Client elasticsearhClient = elasticsearchNode.client();

Shutting Down the Node

elasticsearchNode.close();

A nice implementation of using the module can be seen at ElasticsearchClient.java in the loklak project. It uses the settings from a configuration file and builds the node using it.


Visualisation using elasticsearch-head

So by now, we have an Elasticsearch client which is capable of doing all sorts of operations on the node. But how do we visualise the data that is being stored? Writing code and running it every time to check results is a lengthy thing to do and significantly slows down development/debugging cycle.

To overcome this, we have a web frontend called elasticsearch-head which lets us execute Elasticsearch queries and monitor the cluster.

To run elasticsearch-head, we first need to have grunt-cli installed –

$ sudo npm install -g grunt-cli

Next, we will clone the repository using git and install dependencies –

$ git clone git://github.com/mobz/elasticsearch-head.git
$ cd elasticsearch-head
$ npm install

Next, we simply need to run the server and go to indicated address on a web browser –

$ grunt server

At the top, enter the location at which elasticsearch-head can interact with the cluster and Connect.

Upon connecting, the dashboard appears telling about the status of cluster –

The dashboard shown above is from the loklak project (will talk more about it).

There are 5 major sections in the UI –
1. Overview: The above screenshot, gives details about the indices and shards of the cluster.
2. Index: Gives an overview of all the indices. Also allows to add new from the UI.
3. Browser: Gives a browser window for all the documents in the cluster. It looks something like this –

The left pane allows us to set the filter (index, type and field). The table listed is sortable. But we don’t always get what we are looking for manually. So, we have the following two sections.
4. Structured Query: Gives a dead simple UI that can be used to make a well structured request to Elasticsearch. This is what we need to search for to get Tweets from @gsoc that are indexed –

5. Any Request: Gives an advance console that allows executing any query allowable by Elasticsearch API.

A little about the loklak project and Elasticsearch

loklak is a server application which is able to collect messages from various sources, including twitter. The server contains a search index and a peer-to-peer index sharing interface. All messages are stored in an elasticsearch index.

Source: github/loklak/loklak_server

The project uses Elasticsearch to index all the data that it collects. It uses NodeBuilder to create Elasticsearch node and process the index. It is flexible enough to join an existing cluster instead of creating a new one, just by changing the configuration file.

Conclusion

This blog post tries to explain how NodeBuilder can be used to create Elasticsearch nodes and how they can be configured using Elasticsearch Settings.

It also demonstrates the installation and basic usage of elasticsearch-head, which is a great library to visualise and check queries against an Elasticsearch cluster.

The official Elasticsearch documentation is a good source of reference for its Java API and all other aspects.

Adding swap space to your DigitalOcean droplet, if you run out of RAM

The Open Event Android App generator runs on a DigitalOcean. The deployment runs on a USD 10 box, that has 1 GB of RAM, but for testing I often use a USD 5 box, that has only 512mb of RAM.

When trying to build an android app using gradle and Java 8, there could be an issue where you run out of RAM (especially if it’s 512 only).

What we can do to remedy this problem is creating a swapfile. On an SSD based system, Swap spaces work almost as fast as RAM, because SSDs have very high R/W speeds.

Check hard disk space availability using

df -h

There should be an output like this

Filesystem      Size  Used Avail Use% Mounted on
udev            238M     0  238M   0% /dev
tmpfs            49M  624K   49M   2% /run
/dev/vda1        20G  1.1G   18G   6% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/1001

The steps to create a swap file and allocating it as swap are

sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

We can verify using

sudo swapon --show
NAME      TYPE  SIZE USED PRIO
/swapfile file 1024M   0B   -1

And now if we see RAM usage using free -h , we’ll see

              total        used        free      shared  buff/cache   available
Mem:           488M         37M         96M        652K        354M        425M
Swap:          1.0G          0B        1.0G

Do not use this as a permanent measure for any SSD based filesystem. It can corrupt your SSD if used as swap for long. We use this only for short periods of time to help us build android apks on low ram systems.