Multithreading – blog.fossasia.org

Configuring Youtube Scraper with Search Endpoint in Loklak Server

Post author:Vibhor Verma
Post published:August 20, 2017
Post category:FOSSASIA GSoC loklak Tutorial
Post comments:0 Comments

Youtube Scraper is one of the interesting web scrapers of Loklak Server with unique implementation of its data scraping and data key creation (using RDF). It couldn’t be accessed as it didn’t have any url endpoint. I configured it to use both as separate endpoint (api/youtubescraper) and search endpoint (/api/search.json).

Usage:

YoutubeScraper Endpoint: /api/youtubescraperExample:http://api.loklak.org/api/youtubescraper?query=https://www.youtube.com/watch?v=xZ-m55K3FhQ&scraper=youtube
SearchServlet Endpoint: /api/search.json

Example: http://api.loklak.org/api/search.json?query=https://www.youtube.com/watch?v=xZ-m55K3FhQ&scraper=youtube

The configurations added in Loklak Server are:-

1) Endpoint

We can access YoutubeScraper using endpoint /api/youtubescraper endpoint. Like other scrapers, I have used BaseScraper class as superclass for this functionality .

2) PrepareSearchUrl

The prepareSearchUrl method creates youtube search url that is used to scrape Youtube webpage. YoutubeScraper takes url as input. But youtube link could also be a shortened link. That is why, the video id is stored as query. This approach optimizes the scraper and adds the capability to add more scrapers to it.

Currently YoutubeScraper scrapes the video webpages of Youtube, but scrapers for search webpage and channel webpages can also be added.

URIBuilder url = null;
String midUrl = "search/";
    try {
       switch(type) {
           case "search":
               midUrl = "search/";
               url = new URIBuilder(this.baseUrl + midUrl);
               url.addParameter("search_query", this.query);
               break;
           case "video":
               midUrl = "watch/";
               url = new URIBuilder(this.baseUrl + midUrl);
               url.addParameter("v", this.query);
               break;
           case "user":
               midUrl = "channel/";
               url = new URIBuilder(this.baseUrl + midUrl + this.query);
               break;
           default:
               url = new URIBuilder("");
               break;
       }
    } catch (URISyntaxException e) {
       DAO.log("Invalid Url: baseUrl = " + this.baseUrl + ", mid-URL = " + midUrl + "query = " + this.query + "type = " + type);
       return "";
    }

3) Get-Data-From-Connection

The getDataFromConnection method is used to fetch Bufferedreader object and input it to scrape method. In YoutubeScraper, this method has been overrided to prevent using default method implementation i.e. use type=all

@Override
public Post getDataFromConnection() throws IOException {
    String url = this.prepareSearchUrl(this.type);
    return getDataFromConnection(url, this.type);
}

4) Set scraper parameters input as get-parameters

The Map data-structure of get-parameters fetched by scraper fetches type and query. For URL, the video hash-code is separated from url and then used as query.

this.query = this.getExtraValue("query");
this.query = this.query.substring(this.query.length() - 11);

5) Scrape Method

Scrape method runs the different scraper methods (in YoutubeScraper, there is only one), iterate it using PostTimeline and wraps in Post object to the output. This simple function can improve flexibility of scraper to scrape different pages concurrently.

Post out = new Post(true);
Timeline2 postList = new Timeline2(this.order);
postList.addPost(this.parseVideo(br, type, url));
out.put("videos", postList.toArray());

References

What is an RDF triple explained on Stackoverflow: https://stackoverflow.com/questions/273218/whats-a-rdf-triple
Tutorial on Scraping with Regular Expressions: http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_PDFsText_Files_in_Python_Using_Regular_Expressions.php
Youtube Video-Id Format: https://webapps.stackexchange.com/questions/54443/format-for-id-of-youtube-video

Scraping Concurrently with Loklak Server

Post author:Vibhor Verma
Post published:August 7, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

At Present, SearchScraper in Loklak Server uses numerous threads to scrape Twitter website. The data fetched is cleaned and more data is extracted from it. But just scraping Twitter is under-performance.

Concurrent scraping of other websites like Quora, Youtube, Github, etc can be added to diversify the application. In this way, single endpoint search.json can serve multiple services.

As this Feature is under-refinement, We will discuss only the basic structure of the system with new changes. I tried to implement more abstract way of Scraping by:-

1) Fetching the input data in SearchServlet

Instead of selecting the input get-parameters and referencing them to be used, Now complete Map object is referenced, helping to be able to add more functionality based on input get-parameters. The dataArray object (as JSONArray) is fetched from DAO.scrapeLoklak method and is embedded in output with key results

    // start a scraper
    inputMap.put("query", query);
    DAO.log(request.getServletPath() + " scraping with query: "
           + query + " scraper: " + scraper);
    dataArray = DAO.scrapeLoklak(inputMap, true, true);

2) Scraping the selected Scrapers concurrently

In DAO.java, the useful get parameters of inputMap are fetched and cleaned. They are used to choose the scrapers that shall be scraped, using getScraperObjects() method.

Timeline2.Order order= getOrder(inputMap.get("order"));
Timeline2 dataSet = new Timeline2(order);
List<String> scraperList = Arrays.asList(inputMap.get("scraper").trim().split("\\s*,\\s*"));

Threads are created to fetch data from different scrapers according to size of list of scraper objects fetched. input map is passed as argument to the scrapers for further get parameters related to them and output data according to them.

List<BaseScraper> scraperObjList = getScraperObjects(scraperList, inputMap);
ExecutorService scraperRunner = Executors.newFixedThreadPool(scraperObjList.size());

try{
    for (BaseScraper scraper : scraperObjList)
    {
        scraperRunner.execute(() -> {
            dataSet.mergePost(scraper.getData());
        });

    }

} finally {
    scraperRunner.shutdown();

    try {
        scraperRunner.awaitTermination(24L, TimeUnit.HOURS);
    } catch (InterruptedException e) { }
}

3) Fetching the selected Scraper Objects in DAO.java

Here the variable of abstract class BaseScraper (SuperClass of all search scrapers) is used to create List of scrapers to be scraped. All the scrapers’ constructors are fed with input map to be scraped accordingly.

List<BaseScraper> scraperObjList = new ArrayList<BaseScraper>();
BaseScraper scraperObj = null;

if (scraperList.contains("github") || scraperList.contains("all")) {
    scraperObj = new GithubProfileScraper(inputMap);
    scraperObjList.add(scraperObj);
}
.
.
.

References:

Best practices of Multithreading in Java: https://stackoverflow.com/questions/17018507/java-multithreading-best-practice
ExecutorService vs Casual Thread Spawner: https://stackoverflow.com/questions/26938210/executorservice-vs-casual-thread-spawner
Basic Data Structures used in Java: https://www.eduonix.com/blog/java-programming-2/learn-to-implement-data-structures-in-java/

Multithreading implementation in Loklak Server

Post author:Vibhor Verma
Post published:July 18, 2017
Post category:FOSSASIA GSoC loklak
Post comments:0 Comments

Loklak Server is a near-realtime system. It performs a large number of tasks and are very costly in terms of resources. Its basic function is to scrape all data from websites and output it at the endpoint. In addition to scraping data, there is also a need to perform other tasks like refining and cleaning of data. That is why, multiple threads are instantiated. They perform other tasks like:

Refining of data and extract more data

The data fetched needs to be cleaned and refined before outputting it. Some of the examples are:

a) Removal of html tags from tweet text:

After extracting text from html data and feeding to TwitterTweet object, it concurrently runs threads to remove all html from text.

b) Unshortening of url links:

The url links embedded in the tweet text may track the users with the help of shortened urls. To prevent this issue, a thread is instantiated to unshorten the url links concurrently while cleaning of tweet text.

Indexing all JSON output data to ElasticSearch

While extracting JSON data as output, there is a method here in Timeline.java that indexes data to ElasticSearch.

Managing multithreading

To manage multithreading, Loklak Server applies following objects:

1. ExecutorService

To deal with large numbers of threads ExecutorService object is used to handle threads as it helps JVM to prevent any resource overflow. Thread’s lifecycle can be controlled and its creation cost can be optimized. This is the best example of ExecutorService application is here:

.
.
public class TwitterScraper {
    // Creation of at max 40 threads. This sets max number of threads to 40 at a time
    public static final ExecutorService executor = Executors.newFixedThreadPool(40);
    .
    .
    .
    .
    // Feeding of TwitterTweet object with data
    TwitterTweet tweet = new TwitterTweet(
        user.getScreenName(),
        Long.parseLong(tweettimems.value),
        props.get("tweettimename").value,
        props.get("tweetstatusurl").value,
        props.get("tweettext").value,
        Long.parseLong(tweetretweetcount.value),
        Long.parseLong(tweetfavouritecount.value),
        imgs, vids, place_name, place_id,
        user, writeToIndex,  writeToBackend
    );
    // Starting thread to refine TwitterTweet data
    if (tweet.willBeTimeConsuming()) {
       executor.execute(tweet);
    }    .
    .
    .

2. basic Thread class

Thread class can also be used instead of ExecutorService in cases where there is no resource crunch. But it is always suggested to use ExecutorService due to its benefits. Thread implementation can be used as an anonymous class like here.

3. Runnable interface

Runnable interface can be used to create an anonymous class or classes which does more task than just a task concurrently. In Loklak Server, TwitterScraper concurrently indexes the data to ElasticSearch, unshortens link and cleans data. Have a look at implementation here.

Resources:

Loklak Server: https://github.com/loklak/loklak_server
ExecutorService Class: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html
MultiThreading: https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)
RedirectUnshortener: https://github.com/loklak/loklak_server/blob/0f055ea6d2d768ea13b29c6fee20ab95902d70ab/src/org/loklak/harvester/RedirectUnshortener.java
Threads vs ExecutorService: https://stackoverflow.com/questions/26938210/executorservice-vs-casual-thread-spawner