Backend Scraping in Loklak Server
Loklak Server is a peer-to-peer Distributed Scraping System. It scrapes data from websites and also maintain other sources like peers, storage and a backend server to scrape data. Maintaining different sources has it’s benefits of not engaging in costly requests to the websites, no scraping of data and no cleaning of data.
Loklak Server can maintain a secondary Loklak Server (or a backend server) tuned for storing large amount of data. This enables the primary Loklak Server fetch data in return of pushing all scraped data to the backend.
Lately there was a bug in backend search as a new feature of filtering tweets was added to scraping and indexing, but not for backend search. To fix this issue, I had backtracked the backend search codebase and fix it.
Let us discuss how Backend Search works:-
1) When query is made from search endpoint with:
a) source=all
When source is set to all. The first TwitterScraper and Messages from local search server is preferred. If the messages scraped are not enough or no output has been returned for a specific amount of time, then, backend search is initiated
b) source=backend
SearchServlet specifically scrapes directly from backend server.
2) Fetching data from Backend Server
The input parameters fetched from the client is feeded into DAO.searchBackend method. The list of backend servers fetched from config file. Now using these input parameters and backend servers, the required data is scraped and output to the client.
In DAO.searchOnOtherPeers method. the request is sent to multiple servers and they are arranged in order of better response rates. This method invokes SearchServlet.search method for sending request to the mentioned servers.
List<String> remote = getBackendPeers(); if (remote.size() > 0) { // condition deactivated because we need always at least one peer Timeline tt = searchOnOtherPeers(remote, q, filterList, order, count, timezoneOffset, where, SearchServlet.backend_hash, timeout); if (tt != null) tt.writeToIndex(); return tt; }
3) Creation of request url and sending requests
The request url is created according to the input parameters passed to SearchServlet.search method. Here the search url is created according to input parameters and request is sent to the respective servers to fetch the required messages.
// URL creation urlstring = protocolhostportstub + "/api/search.json?q=" + URLEncoder.encode(query.replace(' ', '+'), "UTF-8") + "&timezoneOffset=" + timezoneOffset + "&maximumRecords=" + count + "&source=" + (source == null ? "all" : source) + "&minified=true&shortlink=false&timeout=" + timeout; if(!"".equals(filterString = String.join(", ", filterList))) { urlstring = urlstring + "&filter=" + filterString; } // Download data byte[] jsonb = ClientConnection.downloadPeer(urlstring); if (jsonb == null || jsonb.length == 0) throw new IOException("empty content from " + protocolhostportstub); String jsons = UTF8.String(jsonb); JSONObject json = new JSONObject(jsons); if (json == null || json.length() == 0) return tl; // Final data fetched to be returned JSONArray statuses = json.getJSONArray("statuses");
References
- Social peer-to-peer processes: https://en.wikipedia.org/wiki/Social_peer-to-peer_processes
- Parallel Random-Access Machine: http://pages.cs.wisc.edu/~tvrdik/2/html/Section2.html
- Distributed Algorithm (Cole–Vishkin algorithm): http://homepage.divms.uiowa.edu/~ghosh/color.pdf