Link Preview Service from SUSI Server

 SUSI Webchat, SUSI Android app, SUSI iOS app are various SUSI clients which depend on response from SUSI Server. The most common response of SUSI Server is in form of links. Clients usually need to show the preview of the links to the user. This preview may include featured image, description and the title of the link.  Clients show this information by using various 3rd party APIs and libraries. We planned to create an API endpoint for this on SUSI Server to give the preview of the link. This service is called LinkPreviewService.
String url = post.get("url", "");
        if(url==null || url.isEmpty()){
            jsonObject.put("message","URL Not given");
            jsonObject.put("accepted",false);
            return new ServiceResponse(jsonObject);
        }

This API Endpoint accept only 1 get parameter which is the URL whose preview is to be shown.

Here we also check if no parameter or wrong URL parameter was sent. If that was the the case then we return an error message to the user.

 SourceContent sourceContent =     TextCrawler.scrape(url,3);
        if (sourceContent.getImages() != null) jsonObject.put("image", sourceContent.getImages().get(0));
        if (sourceContent.getDescription() != null) jsonObject.put("descriptionShort", sourceContent.getDescription());
        if(sourceContent.getTitle()!=null)jsonObject.put("title", sourceContent.getTitle());
        jsonObject.put("accepted",true);
        return new ServiceResponse(jsonObject);
    }

The TextCrawler function accept two parameters. One is the url of the website which is to be scraped for the preview data and the other is depth. To get the images, description and title there are methods built in. Here we just call those methods and set them in our JSON Object.

 private String htmlDecode(String content) {
        return Jsoup.parse(content).text();
    }

Text Crawler is based on Jsoup. Jsoup is a java library that is used to scrape HTML pages.

To get anything from Jsoup we need to decode the content of HTML to Text.

public List<String> getImages(Document document, int imageQuantity) {
        Elements media = document.select("[src]");
        while(var5.hasNext()) {
            Element srcElement = (Element)var5.next();
            if(srcElement.tagName().equals("img")) {
                ((List)matches).add(srcElement.attr("abs:src"));
            }
        }

 The getImages method takes the HTML document from the JSoup and find the image tags in that. We have given the imageQuantity parameter in the function, so accordingly it returns the src attribute of the first n images it find.

This API Endpoint can be seen working on

http://127.0.0.1:4000/susi/linkPreview.json?url=<ANY URL>

A real working example of this endpoint would be http://api.susi.ai/susi/linkPreview.json?url=https://techcrunch.com/2017/07/23/dear-tech-dudes-stop-being-such-idiots-about-women/

Resources:

Web Crawlers: https://www.promptcloud.com/data-scraping-vs-data-crawling/

JSoup: https://jsoup.org/

JSoup Api Docs: https://jsoup.org/apidocs/

Parsing HTML with JSoup: http://www.baeldung.com/java-with-jsoup