Simplifying Scrapers using BaseScraper
Loklak Server‘s main function is to scrape data from websites and other sources and output in different formats like JSON, xml and rss. There are many scrapers in the project that scrape data and output them, but are implemented with different design and libraries which makes them different from each other and a difficult to fix changes.
Due to variation in scrapers’ design, it is difficult to modify them and fix the same issue (any issue, if it appears) in each of them. This issue signals fault in design. To solve this problem, Inheritance can be brought into application. Thus, I created BaseScraper abstract class so that scrapers are more concentrated on fetching data from HTML and all supportive tasks like creating connection with the help of url are defined in BaseScraper.
The concept is pretty easy to implement, but for a perfect implementation, there is a need to go through the complete list of tasks a scraper does.
These are the following tasks with descriptions and how they are implemented using BaseScraper:
-
Endpoint that triggers the scraper
Every search scraper inherits class AbstractAPIHandler. This is used to fetch get parameters from the endpoint according to which data is scraped from the scraper. The arguments from serviceImpl method is used to generate output and is returned to it as JSONObject.
For this task, the method serviceImpl has been defined in BaseScraper and method getData is implemented to return the output. This method is the driver method of the scraper.
public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights, JSONObjectWithDefault permissions) throws APIException { this.setExtra(call); return this.getData().toJSON(false, "metadata", "posts"); }
-
Constructor
The constructor of Scraper defines the base URL of the website to be scraped, name of the scraper and data structure to fetch all get parameters input to the scraper. For get parameters, the Map data structure is used to fetch them from Query object.
Since every scraper has it’s own different base URL, scraper name and get parameters used, so it is implemented in respective Scrapers. QuoraProfileScraper is an example which has these variables defined.
-
Get all input variables
To get all input variables, there are setters and getters defined for fetching them as Map from Query object in BaseScraper. There is also an abstract method getParam(). It is defined in respective scrapers to fetch the useful parameters for scraper and set them to the scraper’s class variables.
// Setter for get parameters from call object protected void setExtra(Query call) { this.extra = call.getMap(); this.query = call.get("query", ""); this.setParam(); } // Getter for get parameter wrt to its key public String getExtraValue(String key) { String value = ""; if(this.extra.get(key) != null) { value = this.extra.get(key).trim(); } return value; } // Defination in QuoraProfileScraper protected void setParam() { if(!"".equals(this.getExtraValue("type"))) { this.typeList = Arrays.asList(this.getExtraValue("type").trim().split("\\s*,\\s*")); } else { this.typeList = new ArrayList<String>(); this.typeList.add("all"); this.setExtraValue("type", String.join(",", this.typeList)); } }
-
URL creation for web scraper
The URL creation shall be implemented in a separate method as in TwitterScraper. The following is the rough implementation adapted from one of my pull request:
protected String prepareSearchUrl(String type) { URIBuilder url = null; String midUrl = "search/"; try { switch(type) { case "question": url = new URIBuilder(this.baseUrl + midUrl); url.addParameter("q", this.query); url.addParameter("type", "question"); . . } . . return url.toString(); }
-
Get BufferedReader object from InputStream
getDataFromConnection method fetches the BufferedReader object from ClientConnection. This object reads the web page line by line by the scrape method to fetch data. See here.
ClientConnection connection = new ClientConnection(url); BufferedReader br = getHtml(connection); . . . public BufferedReader getHtml(ClientConnection connection) { if (connection.inputStream == null) { return null; } BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8)); return br; }
-
Scraping of data from HTML
The Scraper method for scraping data is declared abstract in BaseScraper and defined in the scraper. This can be a perfect example of implementation for BaseScraper (See code the here) and scraper (here).
-
Output of data
The output of scrape method is fetched in Post data objects that are implemented for the respective scraper. These Post objects are added to Timeline iterator and which outputs data as JSONArray. Later the objects are output in enclosed Post object wrapper.
This data can be directly output as Post object, but adding it to iterator makes the Post Objects capable to be sorted in an order and be indexed to ElasticSearch.
Resources
- Loklak Server: https://github.com/loklak/loklak_server
- ElasticSearch: https://www.elastic.co/webinars/getting-started-elasticsearch?elektra=home&storm=sub1