Skip to content

Developers of Open Source Software, Open Hardware, Open Knowledge in Asia

  • Home
  • About
    • Background & Mission
    • Licenses
    • Team
  • Apply
    • Codeheat
  • Contribute
  • Projects
  • Blog
  • Events
    • Sponsorship
    • Open Source Event Management
    • Sciencehack.Asia
    • Science Hack India
    • FOSSASIA Summit 2018
    • OpenTechSummit China
    • Jugaadfest India
    • FOSSASIA Summit 2017
    • FOSSASIA Summit 2016
    • FOSSASIA Summit 2015
    • FOSSASIA Summit 2014
    • FOSSASIA Summit 2012
    • FOSSASIA Summit 2011
    • FOSSASIA Summit 2010
    • MiniDebConf Vietnam 2010
    • GNOME.Asia 2009
  • Donate
  • Contact

Social

  • View fossasia’s profile on Facebook
  • View fossasia’s profile on Twitter
  • View mariobehling’s profile on LinkedIn
  • View fossasia’s profile on GitHub
  • View UCQprMsG-raCIMlBudm20iLQ’s profile on YouTube
  • View 108920596016838318216’s profile on Google+
  • View fossasia’s profile on Flickr
My Tweets

Tags

  • AI
  • android
  • API
  • Artificial Intelligence
  • asksusi
  • Citizen Science
  • codeheat
  • css
  • documentation
  • ember JS
  • flask
  • FOSSASIA
  • Google Summer of Code
  • GSoC
  • GSoC17
  • java
  • Javascript
  • json
  • loklak
  • loklak server
  • Meilix
  • OpenCV
  • open event
  • open event frontend
  • Open Science
  • open source
  • Open Source
  • organizer server
  • P2P Search Engine
  • Personal Assistant
  • phimpme
  • Pocket Science
  • PSLab
  • Python
  • search-engine
  • sTeam
  • Susi
  • SUSI.AI
  • SUSI Web Chat
  • Susper
  • Testing
  • Tutorial
  • UI
  • yacy
  • yaydoc

Recent Posts

  • FOSSASIA Internship Program 2018
  • Daimler: Our developers know about the advantages of Open Source Software
  • Unit Tests for REST-API in Python Web Application
  • Badgeyay: Integrating EmberJS Frontend with Flask Backend

Subscribe





Fetching Metadata in Loklak Server

In Loklak Server multiscrapers are working fine but there was a need to setup metadata framework to be embedded with the data. Metadata outputs the parameters passed, number of hits on the webpage to fetch results and number of results outputted.

There is no metadata framework for TwitterScraper. Metadata is collected but there are 2 issues:

1) the metadata fields are directly feeded while outputing data.

2) Every Scraper had different metadata fields or none.

To improve this for multiscraper system, I embedded metadata by configuring in the BaseScraper class and in PostTimeline iterator. If the metadata is directly collected in BaseScraper itself, then it will become non-of-developer-concern while working on scrapers and he can concentrate on improving scrapers.

These are the following changes I made in code:

1) Input Get-Parameters

For scrapers, one of the metadata field was input parameters. I directly added them in metadata block.

protected Post getMetadata() {
    Post metadata = new Post(true);
    metadata.put("hits", this.hits);
    metadata.put("count", this.count);
    metadata.put("scraper", this.scraperName);
    metadata.put("input_parameters", this.extra);
    return metadata;
}

 

2) Hits and Counts

Hits refer to number of times Loklak Server made a hit to the target website where as Counts refer to number of posts scraped by the scraper. To fetching these data was easy.

For count, I added a method putData in BaseScraper. It shall be used to create list of posts instead of directly creating the list. Here I have added counter which counts the posts.

protected Post putData(Post typeArray, String key, JSONArray postList) {
    this.count = this.count + postList.length();
    typeArray.put(key, postList);
    return typeArray;
}

 

For hits, I just counted the number of times the URL was fed into ClientConnection method.

public Post getDataFromConnection(String url, String type) throws IOException {
// This adds to hits count even if connection fails
    this.hits++;
    ClientConnection connection = new ClientConnection(url);
.
.
.

 

3) For multiscrapers in Search Endpoint

This was a bit tricky task. For creating metadata block for all the scrapers, I had to fetch metadata block of all the scrapers, process them and then output with the results. I added this to PostTimeline iterator and implemented in a loop when a scraper outputs data.

public void collectMetadata(JSONObject metadata) {
    // INITIALIZE PARAMETERS
    int hits = 0;
    int count = 0;
    Set scrapers = new HashSet<String>();

    // GET LIST OF KEYS IN SCRAPER
    List<String> listKeys = new ArrayList<String>(this.posts.keySet());
    int n = listKeys.size();

    for (int i = 0; i < n; i++) {
        // FETCH METADATA POST FROM SCRAPED DATA
        Post postMetadata = (Post) this.posts.get(listKeys.get(i)).get("metadata");
        hits = hits + Integer.parseInt(String.valueOf(postMetadata.get("hits")));
        count = count + Integer.parseInt(String.valueOf(postMetadata.get("count")));
        scrapers.add(postMetadata.get("scraper"));
    }

    // SET OUTPUT
    metadata.put("hits", hits);
    metadata.put("count", count);
    metadata.put("scraper_count", scrapers.size());
    metadata.put("scrapers", scrapers);
}

 

References

  • Crawlers and Metadata Extraction (Stuff that needs to be solved): https://vimeo.com/53109189
  • Why Metadata? https://www.villanovau.com/resources/bi/metadata-importance-in-data-driven-world/#.WZmbMKvhXeQ

Share this:

  • Click to print (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)

Related

Published by

Vibhor Verma

Enthusiast | Passionate | on my way View all posts by Vibhor Verma

Posted on September 1, 2017January 24, 2018Author Vibhor VermaCategories FOSSASIA, loklakTags BaseScraper, FOSSASIA, json, loklak, loklak server, MetaData, SearchServlet, software development, Timeline Iterator, web scraping

Post navigation

Previous Previous post: Using Multiple Languages in Giggity app
Next Next post: Using Protractor for UI Tests in Angular JS for Loklak Apps Site
Proudly powered by WordPress