In Loklak Server multiscrapers are working fine but there was a need to setup metadata framework to be embedded with the data. Metadata outputs the parameters passed, number of hits on the webpage to fetch results and number of results outputted.
There is no metadata framework for TwitterScraper. Metadata is collected but there are 2 issues:
1) the metadata fields are directly feeded while outputing data.
2) Every Scraper had different metadata fields or none.
To improve this for multiscraper system, I embedded metadata by configuring in the BaseScraper class and in PostTimeline iterator. If the metadata is directly collected in BaseScraper itself, then it will become non-of-developer-concern while working on scrapers and he can concentrate on improving scrapers.
These are the following changes I made in code:
1) Input Get-Parameters
For scrapers, one of the metadata field was input parameters. I directly added them in metadata block.
2) Hits and Counts
Hits refer to number of times Loklak Server made a hit to the target website where as Counts refer to number of posts scraped by the scraper. To fetching these data was easy.
For count, I added a method putData in BaseScraper. It shall be used to create list of posts instead of directly creating the list. Here I have added counter which counts the posts.
For hits, I just counted the number of times the URL was fed into ClientConnection method.
3) For multiscrapers in Search Endpoint
This was a bit tricky task. For creating metadata block for all the scrapers, I had to fetch metadata block of all the scrapers, process them and then output with the results. I added this to PostTimeline iterator and implemented in a loop when a scraper outputs data.
- Crawlers and Metadata Extraction (Stuff that needs to be solved): https://vimeo.com/53109189
- Why Metadata? https://www.villanovau.com/resources/bi/metadata-importance-in-data-driven-world/#.WZmbMKvhXeQ