Using LokLak to Scrape Profiles from Quora, GitHub, Weibo and Instagram

Most of us are really so curious to know about one’s social life. So taking this as a key point, LokLak has many profile scrapers in it. Profile scraper which are now available in LokLak  helps us to know about the posts, followers one has. Few of the profile scrapers available in LokLak are Quora Profile, GitHub Profile, Weibo Profile and Instagram Profile.

How do the scrapers work?

In loklak now we are using java to get the json objects of the scraped profile from different websites as mentioned above. So here is a simple explanation how one of the scraper works. In this post I am going to give you a gist about how Github Profile scraper API works:

In the github profile scraper one can search for a profile without logging in and know the contents like the followers, repositories, gists of that profile and many more.

The simple query which can be used is:

To scrape individual profiles:

https://loklak.org/api/githubprofilescraper.json?profile=kavithaenair

To scrape organization profiles:

https://loklak.org/api/githubprofilescraper.json?profile=fossasia

Jsoup is an API and it is a easiest way used by java developers for scraping the web i.e.,web scraping. This API is used for manipulating and extracting data using DOM, CSS like methods. So in here, the Jsoup API is helping us to extract the html data and with the help of the tags used in the html extracted data we are trying to get the relevant data which is needed.

How do we get the matching elements?

We here are using special methods like getElementsByAttributeValueContaining() of the org.jsoup.nodes.Element class to get the data. For instance, to get the email from the extracted data the code is written as:

String email = html.getElementsByAttributeValueContaining("itemprop", "email").text();
            if (!email.contains("@"))
                  email = "";
            githubProfile.put("email", email);

Code:

Here is the java code which imports and extracts data:

Imports the html file:

html = Jsoup.connect("https://github.com/" + profile).get();

Extracts the html file for individual user:

/*If Individual*/
           if (html.getElementsByAttributeValueContaining("class", "user-profile-nav").size() != 0) {
               scrapeGithubUser(githubProfile, terms, profile, html);
           }
           if (terms.contains("gists") || terms.contains("all")) {
               String gistsUrl = GITHUB_API_BASE + profile + GISTS_ENDPOINT;
               JSONArray gists = getDataFromApi(gistsUrl);
               githubProfile.put("gists", gists);
           }
           if (terms.contains("subscriptions") || terms.contains("all")) {
               String subscriptionsUrl = GITHUB_API_BASE + profile + SUBSCRIPTIONS_ENDPOINT;
               JSONArray subscriptions = getDataFromApi(subscriptionsUrl);
               githubProfile.put("subscriptions", subscriptions);
           }
           if (terms.contains("repos") || terms.contains("all")) {
               String reposUrl = GITHUB_API_BASE + profile + REPOS_ENDPOINT;
               JSONArray repos = getDataFromApi(reposUrl);
               githubProfile.put("repos", repos);
           }
           if (terms.contains("events") || terms.contains("all")) {
               String eventsUrl = GITHUB_API_BASE + profile + EVENTS_ENDPOINT;
               JSONArray events = getDataFromApi(eventsUrl);
               githubProfile.put("events", events);
          }
          if (terms.contains("received_events") || terms.contains("all")) {
              String receivedEventsUrl = GITHUB_API_BASE + profile + RECEIVED_EVENTS_ENDPOINT;
              JSONArray receivedEvents = getDataFromApi(receivedEventsUrl);
              githubProfile.put("received_events", receivedEvents);
          }

Extracts the html file for organization:

/*If organization*/
if (html.getElementsByAttributeValue("class", "orgnav").size() != 0) {
    scrapeGithubOrg(profile, githubProfile, html);
}

And this is the sample output:

For query: https://loklak.org/api/githubprofilescraper.json?profile=kavithaenair
 
{
  "data": [{
    "joining_date": "2016-04-12",
    "gists_url": "https://api.github.com/users/kavithaenair/gists",
    "repos_url": "https://api.github.com/users/kavithaenair/repos",
    "user_name": "kavithaenair",
    "bio": "GSoC'17 @loklak @fossasia ; Developer @fossasia ; Intern @amazon",
    "subscriptions_url": "https://api.github.com/users/kavithaenair/subscriptions",
    "received_events_url": "https://api.github.com/users/kavithaenair/received_events",
    "full_name": "Kavitha E Nair",
    "avatar_url": "https://avatars0.githubusercontent.com/u/18421291",
    "user_id": "18421291",
    "events_url": "https://api.github.com/users/kavithaenair/events",
    "organizations": [
      {
        "img_link": "https://avatars1.githubusercontent.com/u/6295529?v=3&s=70",
        "link": "https://github.com/fossasia",
        "label": "fossasia",
        "img_Alt": "@fossasia"
      },
      {
        "img_link": "https://avatars0.githubusercontent.com/u/10620750?v=3&s=70",
        "link": "https://github.com/coala",
        "label": "coala",
        "img_Alt": "@coala"
      },
      {
        "img_link": "https://avatars1.githubusercontent.com/u/11370631?v=3&s=70",
        "link": "https://github.com/loklak",
        "label": "loklak",
        "img_Alt": "@loklak"
      },
      {
        "img_link": "https://avatars2.githubusercontent.com/u/24720168?v=3&s=70",
        "link": "https://github.com/bvrit-wise-django-team",
        "label": "bvrit-wise-django-team",
        "img_Alt": "@bvrit-wise-django-team"
      }
    ],
    "home_location": "\n    Hyderabad, India\n",
    "works_for": "",
    "special_link": "https://www.overleaf.com/read/ftnvcphnwzhp",
    "email": "",
    "atom_feed_link": "https://github.com/kavithaenair.atom"
  }],
  "metadata": {"count": 1},
  "session": {"identity": {
    "type": "host",
    "name": "162.158.46.18",
    "anonymous": true
  }}
}

Published by

Kavitha Nair

Open Source Enthusiast and Wanderlust.