Link Preview Service from SUSI Server

 SUSI Webchat, SUSI Android app, SUSI iOS app are various SUSI clients which depend on response from SUSI Server. The most common response of SUSI Server is in form of links. Clients usually need to show the preview of the links to the user. This preview may include featured image, description and the title of the link.  Clients show this information by using various 3rd party APIs and libraries. We planned to create an API endpoint for this on SUSI Server to give the preview of the link. This service is called LinkPreviewService.
String url = post.get("url", "");
        if(url==null || url.isEmpty()){
            jsonObject.put("message","URL Not given");
            jsonObject.put("accepted",false);
            return new ServiceResponse(jsonObject);
        }

This API Endpoint accept only 1 get parameter which is the URL whose preview is to be shown.

Here we also check if no parameter or wrong URL parameter was sent. If that was the the case then we return an error message to the user.

 SourceContent sourceContent =     TextCrawler.scrape(url,3);
        if (sourceContent.getImages() != null) jsonObject.put("image", sourceContent.getImages().get(0));
        if (sourceContent.getDescription() != null) jsonObject.put("descriptionShort", sourceContent.getDescription());
        if(sourceContent.getTitle()!=null)jsonObject.put("title", sourceContent.getTitle());
        jsonObject.put("accepted",true);
        return new ServiceResponse(jsonObject);
    }

The TextCrawler function accept two parameters. One is the url of the website which is to be scraped for the preview data and the other is depth. To get the images, description and title there are methods built in. Here we just call those methods and set them in our JSON Object.

 private String htmlDecode(String content) {
        return Jsoup.parse(content).text();
    }

Text Crawler is based on Jsoup. Jsoup is a java library that is used to scrape HTML pages.

To get anything from Jsoup we need to decode the content of HTML to Text.

public List<String> getImages(Document document, int imageQuantity) {
        Elements media = document.select("[src]");
        while(var5.hasNext()) {
            Element srcElement = (Element)var5.next();
            if(srcElement.tagName().equals("img")) {
                ((List)matches).add(srcElement.attr("abs:src"));
            }
        }

 The getImages method takes the HTML document from the JSoup and find the image tags in that. We have given the imageQuantity parameter in the function, so accordingly it returns the src attribute of the first n images it find.

This API Endpoint can be seen working on

http://127.0.0.1:4000/susi/linkPreview.json?url=<ANY URL>

A real working example of this endpoint would be http://api.susi.ai/susi/linkPreview.json?url=https://techcrunch.com/2017/07/23/dear-tech-dudes-stop-being-such-idiots-about-women/

Resources:

Web Crawlers: https://www.promptcloud.com/data-scraping-vs-data-crawling/

JSoup: https://jsoup.org/

JSoup Api Docs: https://jsoup.org/apidocs/

Parsing HTML with JSoup: http://www.baeldung.com/java-with-jsoup

Continue Reading

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Continue Reading
Close Menu