Youtube Scraper is one of the interesting web scrapers of Loklak Server with unique implementation of its data scraping and data key creation (using RDF). It couldn’t be accessed as it didn’t have any url endpoint. I configured it to use both as separate endpoint (api/youtubescraper) and search endpoint (/api/search.json).
Usage:
- YoutubeScraper Endpoint: /api/youtubescraperExample:http://api.loklak.org/api/youtubescraper?query=https://www.youtube.com/watch?v=xZ-m55K3FhQ&scraper=youtube
- SearchServlet Endpoint: /api/search.json
The configurations added in Loklak Server are:-
1) Endpoint
We can access YoutubeScraper using endpoint /api/youtubescraper endpoint. Like other scrapers, I have used BaseScraper class as superclass for this functionality .
2) PrepareSearchUrl
The prepareSearchUrl method creates youtube search url that is used to scrape Youtube webpage. YoutubeScraper takes url as input. But youtube link could also be a shortened link. That is why, the video id is stored as query. This approach optimizes the scraper and adds the capability to add more scrapers to it.
Currently YoutubeScraper scrapes the video webpages of Youtube, but scrapers for search webpage and channel webpages can also be added.
URIBuilder url = null; String midUrl = "search/"; try { switch(type) { case "search": midUrl = "search/"; url = new URIBuilder(this.baseUrl + midUrl); url.addParameter("search_query", this.query); break; case "video": midUrl = "watch/"; url = new URIBuilder(this.baseUrl + midUrl); url.addParameter("v", this.query); break; case "user": midUrl = "channel/"; url = new URIBuilder(this.baseUrl + midUrl + this.query); break; default: url = new URIBuilder(""); break; } } catch (URISyntaxException e) { DAO.log("Invalid Url: baseUrl = " + this.baseUrl + ", mid-URL = " + midUrl + "query = " + this.query + "type = " + type); return ""; }
3) Get-Data-From-Connection
The getDataFromConnection method is used to fetch Bufferedreader object and input it to scrape method. In YoutubeScraper, this method has been overrided to prevent using default method implementation i.e. use type=all
@Override public Post getDataFromConnection() throws IOException { String url = this.prepareSearchUrl(this.type); return getDataFromConnection(url, this.type); }
4) Set scraper parameters input as get-parameters
The Map data-structure of get-parameters fetched by scraper fetches type and query. For URL, the video hash-code is separated from url and then used as query.
this.query = this.getExtraValue("query"); this.query = this.query.substring(this.query.length() - 11);
5) Scrape Method
Scrape method runs the different scraper methods (in YoutubeScraper, there is only one), iterate it using PostTimeline and wraps in Post object to the output. This simple function can improve flexibility of scraper to scrape different pages concurrently.
Post out = new Post(true); Timeline2 postList = new Timeline2(this.order); postList.addPost(this.parseVideo(br, type, url)); out.put("videos", postList.toArray());
References
- What is an RDF triple explained on Stackoverflow: https://stackoverflow.com/questions/273218/whats-a-rdf-triple
- Tutorial on Scraping with Regular Expressions: http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_PDFsText_Files_in_Python_Using_Regular_Expressions.php
- Youtube Video-Id Format: https://webapps.stackexchange.com/questions/54443/format-for-id-of-youtube-video