Simplifying Scrapers using BaseScraper

Loklak Server's main function is to scrape data from websites and other sources and output in different formats like JSON, xml and rss. There are many scrapers in the project that scrape data and output them, but are implemented with different design and libraries which makes them different from each other and a difficult to fix changes. Due to variation in scrapers’ design, it is difficult to modify them and fix the same issue (any issue, if it appears) in each of them. This issue signals fault in design. To solve this problem, Inheritance can be brought into application. Thus, I created BaseScraper abstract class so that scrapers are more concentrated on fetching data from HTML and all supportive tasks like creating connection with the help of url are defined in BaseScraper. The concept is pretty easy to implement, but for a perfect implementation, there is a need to go through the complete list of tasks a scraper does. These are the following tasks with descriptions and how they are implemented using BaseScraper: Endpoint that triggers the scraper Every search scraper inherits class AbstractAPIHandler. This is used to fetch get parameters from the endpoint according to which data is scraped from the scraper. The arguments from serviceImpl method is used to generate output and is returned to it as JSONObject. For this task, the method serviceImpl has been defined in BaseScraper and method getData is implemented to return the output. This method is the driver method of the scraper. public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights, JSONObjectWithDefault permissions) throws APIException { this.setExtra(call); return this.getData().toJSON(false, "metadata", "posts"); }   Constructor The constructor of Scraper defines the base URL of the website to be scraped, name of the scraper and data structure to fetch all get parameters input to the scraper. For get parameters, the Map data structure is used to fetch them from Query object. Since every scraper has it's own different base URL, scraper name and get parameters used, so it is implemented in respective Scrapers. QuoraProfileScraper is an example which has these variables defined. Get all input variables To get all input variables, there are setters and getters defined for fetching them as Map from Query object in BaseScraper. There is also an abstract method getParam(). It is defined in respective scrapers to fetch the useful parameters for scraper and set them to the scraper's class variables. // Setter for get parameters from call object protected void setExtra(Query call) { this.extra = call.getMap(); this.query = call.get("query", ""); this.setParam(); } // Getter for get parameter wrt to its key public String getExtraValue(String key) { String value = ""; if(this.extra.get(key) != null) { value = this.extra.get(key).trim(); } return value; } // Defination in QuoraProfileScraper protected void setParam() { if(!"".equals(this.getExtraValue("type"))) { this.typeList = Arrays.asList(this.getExtraValue("type").trim().split("\\s*,\\s*")); } else { this.typeList = new ArrayList<String>(); this.typeList.add("all"); this.setExtraValue("type", String.join(",", this.typeList)); } }    URL creation for web scraper The URL creation shall be implemented in a separate method as in TwitterScraper. The following is the rough…

Continue ReadingSimplifying Scrapers using BaseScraper
Read more about the article Iterating the Loklak Server data
Iterating the Loklak Server data

Iterating the Loklak Server data

Loklak Server is amazing for what it does, but it is more impressive how it does the tasks. Iterators are used for and how to use them, but this project has a customized iterator that iterates Twitter data objects. This iterator is Timeline.java . Timeline implements an interface iterable (isn’t it iterator?). This interface helps in using Timeline as an iterator and add methods to modify, use or create the data objects. At present, it only iterates Twitter data objects. I am working on it to modify it to iterate data objects from all web scrapers. The following is a simple example of how an iterator is used. // Initializing arraylist List<String> stringsList = Arrays.asList("foo", "bar", "baz"); // Using iterator to display contents of stringsList System.out.print("Contents of stringsList: "); Iterator iter = al.iterator(); while(iter.hasNext()) { System.out.print(iter.next() + " "); }   This iterator can only iterate data the way array does. (Then why do we need it?) It does the task of iterating objects perfectly, but we can add more functionality to the iterator.   Timeline iterator iterates the MessageEntry objects i.e. superclass of TwitterTweet objects. According to Javadocs, "Timeline is a structure which holds tweet for the purpose of presentation, There is no tweet retrieval method here, just an iterator which returns the tweets in reverse appearing order." Following are some of the tasks it does: As an iterator: This basic use of Timeline is to iterate the MessageEntry objects. It not only iterates the data objects, but also fetches them (See here). // Declare Timeline object according to order the data object has been created Timeline tline = new Timeline(Timeline.parseOrder("created_at")); // Adding data objects to the timeline tline.add(me1); tline.add(me2); . . . // Outputing all data objects as array of JSON objects for (MessageEntry me: tline) { JSONArray postArray = new JSONArray(); for (MessageEntry post : this) { postArray.put(post.toJSON()); } }   The order of iterating the data objects Timeline can arrange and iterate the data objects according to the date of creation of the twitter post, number of retweets or number of favourite counts. For this there is an Enum declaration of Order in the Timeline class which is initialized during creation of Timeline object. [link] Timeline tline = new Timeline(Timeline.parseOrder("created_at"));   Pagination of data objects There is an object cursor, some methods, including getter and setters to support pagination of the data objects. It is only internally implemented, but can also be used to return a section of the result. writeToIndex method This method can be used to write all data fetched by Timeline iterator to ElasticSearch for indexing and to dump that can be used for testing. Thus, indexing of data can concurrently be done while it is iterated. It is implemented here. Other methods It also has methods to output all data as JSON and customized method to add data to Timeline keeping user object and Data separate, etc. There are a bit more things in this iterable class which shall be explored instead.…

Continue ReadingIterating the Loklak Server data