Skip to content
blog.fossasia.org
  • Home
  • Projects
    • Contribute
  • Events
    • Eventyay Platform
    • Event Sponsorships
    • Event Calendar
    • FOSSASIA Summit
    • OpenTechSummit China
    • OpenTechSummit Thailand
    • OpenTechSummit Vietnam
    • Jugaad Fest India
    • Past Events
      • FOSSASIA Summit 2022
      • FOSSASIA Summit 2021
      • FOSSASIA Summit 2020
      • FOSSASIA Summit 2019
      • FOSSASIA Summit 2018
      • FOSSASIA Summit 2017
      • FOSSASIA Summit 2016
      • FOSSASIA Summit 2015
      • FOSSASIA Summit 2014
      • FOSSASIA Summit 2012
      • FOSSASIA Summit 2011
      • FOSSASIA Summit 2010
      • GNOME.Asia 2009
      • MiniDebConf Vietnam 2010
      • Sciencehack.Asia
      • Science Hack India
  • Programs
    • Programs and Opportunities
    • Jobs Opportunities
    • Program Guidelines
    • Codeheat Contest
    • University Internship Program
    • University Student Coding Programs
    • High School Student Program
    • Advanced Developer Program
    • Become a Mentor
      • Become A University Student Mentor
      • Become A High School Student Mentor
  • Shop
  • Blog
  • About
    • Jobs
    • Membership
    • Activities
    • Background & Mission
    • Best Practices
    • Licenses
    • Team
    • Code of Conduct
  • Donate
Menu Close
  • Home
  • Projects
    • Contribute
  • Events
    • Eventyay Platform
    • Event Sponsorships
    • Event Calendar
    • FOSSASIA Summit
    • OpenTechSummit China
    • OpenTechSummit Thailand
    • OpenTechSummit Vietnam
    • Jugaad Fest India
    • Past Events
      • FOSSASIA Summit 2022
      • FOSSASIA Summit 2021
      • FOSSASIA Summit 2020
      • FOSSASIA Summit 2019
      • FOSSASIA Summit 2018
      • FOSSASIA Summit 2017
      • FOSSASIA Summit 2016
      • FOSSASIA Summit 2015
      • FOSSASIA Summit 2014
      • FOSSASIA Summit 2012
      • FOSSASIA Summit 2011
      • FOSSASIA Summit 2010
      • GNOME.Asia 2009
      • MiniDebConf Vietnam 2010
      • Sciencehack.Asia
      • Science Hack India
  • Programs
    • Programs and Opportunities
    • Jobs Opportunities
    • Program Guidelines
    • Codeheat Contest
    • University Internship Program
    • University Student Coding Programs
    • High School Student Program
    • Advanced Developer Program
    • Become a Mentor
      • Become A University Student Mentor
      • Become A High School Student Mentor
  • Shop
  • Blog
  • About
    • Jobs
    • Membership
    • Activities
    • Background & Mission
    • Best Practices
    • Licenses
    • Team
    • Code of Conduct
  • Donate

SearchServlet

Read more about the article Fetching Metadata in Loklak Server

Fetching Metadata in Loklak Server

  • Post author:Vibhor Verma
  • Post published:September 1, 2017
  • Post category:FOSSASIA/loklak
  • Post comments:0 Comments

In Loklak Server multiscrapers are working fine but there was a need to setup metadata framework to be embedded with the data. Metadata outputs the parameters passed, number of hits on the webpage to fetch results and number of results outputted.

There is no metadata framework for TwitterScraper. Metadata is collected but there are 2 issues:

1) the metadata fields are directly feeded while outputing data.

2) Every Scraper had different metadata fields or none.

To improve this for multiscraper system, I embedded metadata by configuring in the BaseScraper class and in PostTimeline iterator. If the metadata is directly collected in BaseScraper itself, then it will become non-of-developer-concern while working on scrapers and he can concentrate on improving scrapers.

These are the following changes I made in code:

1) Input Get-Parameters

For scrapers, one of the metadata field was input parameters. I directly added them in metadata block.

protected Post getMetadata() {
    Post metadata = new Post(true);
    metadata.put("hits", this.hits);
    metadata.put("count", this.count);
    metadata.put("scraper", this.scraperName);
    metadata.put("input_parameters", this.extra);
    return metadata;
}

 

2) Hits and Counts

Hits refer to number of times Loklak Server made a hit to the target website where as Counts refer to number of posts scraped by the scraper. To fetching these data was easy.

For count, I added a method putData in BaseScraper. It shall be used to create list of posts instead of directly creating the list. Here I have added counter which counts the posts.

protected Post putData(Post typeArray, String key, JSONArray postList) {
    this.count = this.count + postList.length();
    typeArray.put(key, postList);
    return typeArray;
}

 

For hits, I just counted the number of times the URL was fed into ClientConnection method.

public Post getDataFromConnection(String url, String type) throws IOException {
// This adds to hits count even if connection fails
    this.hits++;
    ClientConnection connection = new ClientConnection(url);
.
.
.

 

3) For multiscrapers in Search Endpoint

This was a bit tricky task. For creating metadata block for all the scrapers, I had to fetch metadata block of all the scrapers, process them and then output with the results. I added this to PostTimeline iterator and implemented in a loop when a scraper outputs data.

public void collectMetadata(JSONObject metadata) {
    // INITIALIZE PARAMETERS
    int hits = 0;
    int count = 0;
    Set scrapers = new HashSet<String>();

    // GET LIST OF KEYS IN SCRAPER
    List<String> listKeys = new ArrayList<String>(this.posts.keySet());
    int n = listKeys.size();

    for (int i = 0; i < n; i++) {
        // FETCH METADATA POST FROM SCRAPED DATA
        Post postMetadata = (Post) this.posts.get(listKeys.get(i)).get("metadata");
        hits = hits + Integer.parseInt(String.valueOf(postMetadata.get("hits")));
        count = count + Integer.parseInt(String.valueOf(postMetadata.get("count")));
        scrapers.add(postMetadata.get("scraper"));
    }

    // SET OUTPUT
    metadata.put("hits", hits);
    metadata.put("count", count);
    metadata.put("scraper_count", scrapers.size());
    metadata.put("scrapers", scrapers);
}

 

References

  • Crawlers and Metadata Extraction (Stuff that needs to be solved): https://vimeo.com/53109189
  • Why Metadata? https://www.villanovau.com/resources/bi/metadata-importance-in-data-driven-world/#.WZmbMKvhXeQ
Continue ReadingFetching Metadata in Loklak Server
  • FOSSASIA
  • Blog
  • GitHub
  • Projects
  • Code of Conduct
  • About
  • Contact
Copyright - OceanWP Theme by OceanWP