Indexing for multiscrapers in Loklak Server

I recently added multiscraper system which can scrape data from web-scrapers like YoutubeScraper, QuoraScraper, GithubScraper, etc. As scraping is a costly task, it is important to improve it’s efficiency. One of the approach is to index data in cache. TwitterScraper uses multiple sources to optimize the efficiency.

This system uses Post message holder object to store data and PostTimeline (a specialized iterator) to iterate the data objects. This difference in data structures from TwitterScraper leads to the need of different approach to implement indexing of data to ElasticSearch (currently in review process).

These are the following changes I made while implementing ‘indexing of data’ in the project.

1) Writing of data is invoked only using PostTimeline iterator

In TwitterScraper, the data is written in message holder TwitterTweet. So all the tweets are written to index as they are created. Here, when the data is scraped, Writing of the posts is initiated. Scraping of data is considered a heavy process. This approach keeps lower resource usage in average traffic on the server.

protected Post putData(Post typeArray, String key, Timeline2 postList) {
   if(!"cache".equals(this.source)) {
       postList.writeToIndex();
   }
   return this.putData(typeArray, key, postList.toArray());
}

2) One object for holding a message

During the implementation, I kept the same message holder Post and post-iterator PostTimeline from scraping to indexing of data. This helps to keep the structure uniform. Earlier approach involves different types of message wrappers in the way. This approach cuts the processes for looping and transitioning of data structures.

3) Index a list, not a message

In TwitterScraper, as the messages are enqueued in the bulk to be indexed. But in this approach, I have enqueued the complete lists. This approach delays the indexing till the scraper is done with processing the html.

Creating the queue of postlists:

// Add post-lists to queue to be indexed
queueClients.incrementAndGet();
try {
    postQueue.put(postList);
} catch (InterruptedException e) {
DAO.severe(e);
}
queueClients.decrementAndGet();

 

Indexing of the posts in postlists:

// Start indexing of data in post-lists
for (Timeline2 postList: postBulk) {
    if (postList.size() < 1) continue;
    if(postList.dump) {
        // Dumping of data in a file
        writeMessageBulkDump(postList);
    }
    // Indexing of data to ElasticSearch
    writeMessageBulkNoDump(postList);
}

 

4) Categorizing the input parameters

While searching the index, I have divided the query parameters from scraper into 3 categories. The input parameters are added to those categories (implemented using map data structure) and thus data fetched are according to them. These categories are:

// Declaring the QueryBuilder
BoolQueryBuilder query = new BoolQueryBuilder();

 

a) Get the parameter– Get the results for the input fields in map getMap.

// Result must have these fields. Acts as AND operator
if(getMap != null) {
    for(Map.Entry<String, String> field : getMap.entrySet()) {
        query.must(QueryBuilders.termQuery(
field.getKey(), field.getValue()));
    }
}

 

b) Don’t get the parameter- Don’t get the results for the input fields in map notGetMap.

// Result must not have these fields.
if(notGetMap != null) {
    for(Map.Entry<String, String> field : notGetMap.entrySet()) {
        query.mustNot(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));
    }
}

 

c) Get if possible- Get the results with the input fields if they are present in the index.

// Result may preferably also get these fields. Acts as OR operator
if(mayAlsoGetMap != null) {
    for(Map.Entry<String, String> field : mayAlsoGetMap.entrySet()) {
        query.should(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));

    }
}

 

By applying these changes, the scrapers are shifted from a message indexing to list of messages indexing. This way we are keeping load on RAM low, but the aggregation of latest scraped data may be affected. So there will be a need to workaround to solve this issue while scraping itself.

References

Uploading Images to SUSI Server

SUSI Skill CMS is a web app to create and modify SUSI Skills. It needs API Endpoints to function and SUSI Server makes it possible. In this blogpost, we will see how to add a servlet to SUSI Server to upload images and files.

The CreateSkillService.java file is the servlet which handles the process of creating new Skills. It requires different user roles to be implemented and hence it extends the AbstractAPIHandler.

Image upload is only possible via a POST request so we will first override the doPost method in this servlet.

  @Override
  protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
  resp.setHeader("Access-Control-Allow-Origin", "*"); // enable CORS

resp.setHeader enables the CORS for the servlet. This is required as POST requests must have CORS enables from the server. This is an important security feature that is provided by the browser.

        Part file = req.getPart("image");
        if (file == null) {
            json.put("accepted", false);
            json.put("message", "Image not given");
        }

Image upload to servers is usually a Multipart Request. So we get the part which is named as “image” in the form data.

When we receive the image file, then we check if the image with the same name exists on the server or not.

Path p = Paths.get(language + File.separator + “images/” + image_name);

        if (image_name == null || Files.exists(p)) {
                json.put("accepted", false);
                json.put("message", "The Image name not given or Image with same name is already present ");
            }

If the same file is present on the server then we return an error to the user requesting to give a unique filename to upload.

Image image = ImageIO.read(filecontent);
BufferedImage bi = this.createResizedCopy(image, 512, 512, true);
if(!Files.exists(Paths.get(language.getPath() + File.separator + "images"))){
   new File(language.getPath() + File.separator + "images").mkdirs();
           }
ImageIO.write(bi, "jpg", new File(language.getPath() + File.separator + "images/" + image_name));

Then we read the content for the image in an Image object. Then we check if images directory exists or not. If there is no image directory in the skill path specified then create a folder named “images”.

We usually prefer square images at the Skill CMS. So we create a resized copy of the image of 512×512 dimensions and save that copy to the directory we created above.

BufferedImage createResizedCopy(Image originalImage, int scaledWidth, int scaledHeight, boolean preserveAlpha) {
        int imageType = preserveAlpha ? BufferedImage.TYPE_INT_RGB : BufferedImage.TYPE_INT_ARGB;
        BufferedImage scaledBI = new BufferedImage(scaledWidth, scaledHeight, imageType);
        Graphics2D g = scaledBI.createGraphics();
        if (preserveAlpha) {
            g.setComposite(AlphaComposite.Src);
        }
        g.drawImage(originalImage, 0, 0, scaledWidth, scaledHeight, null);
        g.dispose();
        return scaledBI;
    }

The function above is used to create a  resized copy of the image of specified dimensions. If the image was a PNG then it also preserves the transparency of the image while creating a copy.

Since the SUSI server follows an API centric approach, all servlets respond in JSON.

       resp.setContentType("application/json");
       resp.setCharacterEncoding("UTF-8");
       resp.getWriter().write(json.toString());’

At last, we set the character encoding and the character set of the output. This helps the clients to parse the data easily.

To see this endpoint in live send a POST request at http://api.susi.ai/cms/createSkill.json.

Resources

Apache Docs: https://commons.apache.org/proper/commons-fileupload/using.html

Multipart POST Request Tutorial: http://www.codejava.net/java-se/networking/upload-files-by-sending-multipart-request-programmatically

Java File Upload tutorial: https://ursaj.com/upload-files-in-java-with-servlet-api

Jetty Project: https://github.com/jetty-project/

GET and POST requests

If you wonder how to get or update page resource, you have to read this article.

It’s trivial if you have basic knowledge about HTTP protocol. I’d like to get you little involved to this subject.

So GET and POST are most useful methods in HTTP protocol.

What is HTTP?

Hypertext transfer protocol – allow us to communicate between client and server side. In Open Event project we use web browser as client and for now we use Heroku for server side.

Difference between GET and POST methods

GET – it allows to get data from specified resources

POST – it allows to submit new data to specified resources for example by html form.

GET samples:

For example we use it to get details about event

curl http://open-event-dev.herokuapp.com/api/v2/events/95

Response from server:

Of course you can use this for another needs, If you are a poker player I suppose that you’d like to know how many percentage you have on hand.

curl http://www.propokertools.com/simulations/show?g=he&s=generic&b&d&h1=AA&h2=KK&h3&h4&h5&h6&_

POST samples:

curl -X POST https://example.com/resource.cgi

You can often find this action in a contact page or in a login page.

How does request look in python?

We use Requests library to communication between client and server side. It’s very readable for developers. You can find great documentation  and a lot of code samples on their website. It’s very important to see how it works.

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200

I know that samples are very important, but take a look how Requests library fulfils our requirements in 100%. We have decided to use it because we would like to communicate between android app generator and orga server application. We have needed to send request with params(email, app_name, and api of event url) by post method to android generator resource. It executes the process of sending an email – a package of android application to a provided email address.

data = {
    "email": login.current_user.email,
    "app_name": self.app_name,
    "endpoint": request.url_root + "api/v2/events/" + str(self.event.id)
}
r = requests.post(self.app_link, json=data)