I recently added multiscraper system which can scrape data from web-scrapers like YoutubeScraper, QuoraScraper, GithubScraper, etc. As scraping is a costly task, it is important to improve it’s efficiency. One of the approach is to index data in cache. TwitterScraper uses multiple sources to optimize the efficiency.
This system uses Post message holder object to store data and PostTimeline (a specialized iterator) to iterate the data objects. This difference in data structures from TwitterScraper leads to the need of different approach to implement indexing of data to ElasticSearch (currently in review process).
These are the following changes I made while implementing ‘indexing of data’ in the project.
1) Writing of data is invoked only using PostTimeline iterator
In TwitterScraper, the data is written in message holder TwitterTweet. So all the tweets are written to index as they are created. Here, when the data is scraped, Writing of the posts is initiated. Scraping of data is considered a heavy process. This approach keeps lower resource usage in average traffic on the server.
2) One object for holding a message
During the implementation, I kept the same message holder Post and post-iterator PostTimeline from scraping to indexing of data. This helps to keep the structure uniform. Earlier approach involves different types of message wrappers in the way. This approach cuts the processes for looping and transitioning of data structures.
3) Index a list, not a message
In TwitterScraper, as the messages are enqueued in the bulk to be indexed. But in this approach, I have enqueued the complete lists. This approach delays the indexing till the scraper is done with processing the html.
Creating the queue of postlists:
Indexing of the posts in postlists:
4) Categorizing the input parameters
While searching the index, I have divided the query parameters from scraper into 3 categories. The input parameters are added to those categories (implemented using map data structure) and thus data fetched are according to them. These categories are:
a) Get the parameter– Get the results for the input fields in map getMap.
b) Don’t get the parameter- Don’t get the results for the input fields in map notGetMap.
c) Get if possible- Get the results with the input fields if they are present in the index.
By applying these changes, the scrapers are shifted from a message indexing to list of messages indexing. This way we are keeping load on RAM low, but the aggregation of latest scraped data may be affected. So there will be a need to workaround to solve this issue while scraping itself.
- Match query with “operator”:”and” via the Java API: https://discuss.elastic.co/t/match-query-with-operator-and-via-the-java-api/67863/2
- How to use BoolQueryBuilder: https://stackoverflow.com/questions/40923945/how-to-add-bool-query-inside-a-should-must-method-in-java-api