Link Preview Service from SUSI Server
String url = post.get("url", ""); if(url==null || url.isEmpty()){ jsonObject.put("message","URL Not given"); jsonObject.put("accepted",false); return new ServiceResponse(jsonObject); }
This API Endpoint accept only 1 get parameter which is the URL whose preview is to be shown.
Here we also check if no parameter or wrong URL parameter was sent. If that was the the case then we return an error message to the user.
SourceContent sourceContent = TextCrawler.scrape(url,3); if (sourceContent.getImages() != null) jsonObject.put("image", sourceContent.getImages().get(0)); if (sourceContent.getDescription() != null) jsonObject.put("descriptionShort", sourceContent.getDescription()); if(sourceContent.getTitle()!=null)jsonObject.put("title", sourceContent.getTitle()); jsonObject.put("accepted",true); return new ServiceResponse(jsonObject); }
The TextCrawler function accept two parameters. One is the url of the website which is to be scraped for the preview data and the other is depth. To get the images, description and title there are methods built in. Here we just call those methods and set them in our JSON Object.
private String htmlDecode(String content) { return Jsoup.parse(content).text(); }
Text Crawler is based on Jsoup. Jsoup is a java library that is used to scrape HTML pages.
To get anything from Jsoup we need to decode the content of HTML to Text.
public List<String> getImages(Document document, int imageQuantity) { Elements media = document.select("[src]"); while(var5.hasNext()) { Element srcElement = (Element)var5.next(); if(srcElement.tagName().equals("img")) { ((List)matches).add(srcElement.attr("abs:src")); } }
The getImages method takes the HTML document from the JSoup and find the image tags in that. We have given the imageQuantity parameter in the function, so accordingly it returns the src attribute of the first n images it find.
This API Endpoint can be seen working on
http://127.0.0.1:4000/susi/linkPreview.json?url=<ANY URL>
A real working example of this endpoint would be http://api.susi.ai/susi/linkPreview.json?url=https://techcrunch.com/2017/07/23/dear-tech-dudes-stop-being-such-idiots-about-women/
Resources:
Web Crawlers: https://www.promptcloud.com/data-scraping-vs-data-crawling/
JSoup: https://jsoup.org/
JSoup Api Docs: https://jsoup.org/apidocs/
Parsing HTML with JSoup: http://www.baeldung.com/java-with-jsoup