Enhancing Github profile scraper service in loklak

The Github profile scraper is one of the several scrapers present in the loklak project, some of the other scrapers being Quora profile scraper, WordPress profile scraper, Instagram profile scraper, etc.Github profile scraper scrapes the profile of a given github user and provides the data in json format. The scraped data contains user_id, followers, users following the given user, starred repositories of an user, basic information like user’s name, bio and much more. It uses the popular Java Jsoup library for scraping.

The entire source code for the scraper can be found here.

Changes made

The scraper previously provided very limited information like, name of the user, description, starred repository url, followers url, following url, user_id, etc. One of the major problem was the api accepted only the profile name as a parameter and it returned the entire data, that is, the scraper scraped all the data even if the user did not ask for it. Moreover the service provided certain urls as data for example starred repository url, followers url, following url instead of providing the actual data present at those urls.

The scraper contained only one big function where all the scraping was performed. The code was not modular.

The scraper has been enhanced in the ways mentioned below.

  • The entire code has been refactored.Code scraping user specific data has been separated from code scraping organization specific data. Also separate method has been created for accessing the Github API.
  • Apart from profile parameter, the api now accepts another parameter called terms. It is a list of fields the user wants information on. In this way the scraper scrapes only that much data which is required by the user. This allows better response time and prevents unnecessary scraping.
  • The scraper now provides more information like user gists, subscriptions, events, received_events and repositories.

Code refactoring

The code has been refactored to smaller methods. ‘getDataFromApi’ method has been designed to access Github API. All the other methods which want to make request to api.github.com/ now calls ‘getDatFromApi’ method with required parameter. The method is shown below.

private static JSONArray getDataFromApi(String url) {
       URI uri = null;
       try {
           uri = new URI(url);
       } catch (URISyntaxException e1) {
           e1.printStackTrace();
       }

       JSONTokener tokener = null;
       try {
           tokener = new JSONTokener(uri.toURL().openStream());
       } catch (Exception e1) {
           e1.printStackTrace();
       }
       JSONArray arr = new JSONArray(tokener);
       return arr;
   }

 

For example if we want to make a request to the endpoint https://api.github.com/users/djmgit/followers then we can use the above method and we will get a JSONArray in return.

All the code which scrapes user related data has been moved to ‘scrapeGithubUser’ method. This method scrapes basic user information like full name of the user, bio, user name, atom feed link, and location.

String fullName = html.getElementsByAttributeValueContaining("class", "vcard-fullname").text();
githubProfile.put("full_name", fullName);
 
String userName = html.getElementsByAttributeValueContaining("class", "vcard-username").text();
githubProfile.put("user_name", userName);
 
String bio = html.getElementsByAttributeValueContaining("class", "user-profile-bio").text();
githubProfile.put("bio", bio);
 
String atomFeedLink = html.getElementsByAttributeValueContaining("type", "application/atom+xml").attr("href");
githubProfile.put("atom_feed_link", "https://github.com" + atomFeedLink);
 
String worksFor = html.getElementsByAttributeValueContaining("itemprop", "worksFor").text();
githubProfile.put("works_for", worksFor);
 
String homeLocation = html.getElementsByAttributeValueContaining("itemprop", "homeLocation").attr("title");
githubProfile.put("home_location", homeLocation);

 

Next it returns other informations like starred repositories information, followers information, following information and information about the organizations the user belongs to but before fetching such data it checks whether such data is really needed. In this way unnecessary api calls are prevented.

if (terms.contains("starred") || terms.contains("all")) {
            String starredUrl = GITHUB_API_BASE + profile + STARRED_ENDPOINT;
            JSONArray starredData = getDataFromApi(starredUrl);
            githubProfile.put("starred_data", starredData);
 
            int starred = Integer.parseInt(html.getElementsByAttributeValue("class", "Counter").get(1).text());
            githubProfile.put("starred", starred);
        }
 
        if (terms.contains("follows") || terms.contains("all")) {
            String followersUrl = GITHUB_API_BASE + profile + FOLLOWERS_ENDPOINT;
            JSONArray followersData = getDataFromApi(followersUrl);
            githubProfile.put("followers_data", followersData);
 
            int followers = Integer.parseInt(html.getElementsByAttributeValue("class", "Counter").get(2).text());
            githubProfile.put("followers", followers);
        }
 
        if (terms.contains("following") || terms.contains("all")) {
            String followingUrl = GITHUB_API_BASE + profile + FOLLOWING_ENDPOINT;
            JSONArray followingData = getDataFromApi(followingUrl);
            githubProfile.put("following_data", followingData);
 
            int following = Integer.parseInt(html.getElementsByAttributeValue("class", "Counter").get(3).text());
            githubProfile.put("following", following);
        }
 
        if (terms.contains("organizations") || terms.contains("all")) {
            JSONArray organizations = new JSONArray();
            Elements orgs = html.getElementsByAttributeValue("itemprop", "follows");
            for (Element e : orgs) {
                JSONObject obj = new JSONObject();
 
                String label = e.attr("aria-label");
                obj.put("label", label);
 
                String link = e.attr("href");
                obj.put("link", "https://github.com" + link);
 
                String imgLink = e.children().attr("src");
                obj.put("img_link", imgLink);
 
                String imgAlt = e.children().attr("alt");
                obj.put("img_Alt", imgAlt);
 
                organizations.put(obj);
            }
            githubProfile.put("organizations", organizations);
        }

 

Similarly ‘scrapeGithubOrg’ is used to scrape information related to Github organization.

API usage

The API can be used in the ways shown below.

  • api/githubprofilescraper.json?profile=<profile_name> : This is equivalent to api/githubprofilescraper.json?profile=<profile_name>&terms=all :This will return all the data the scraper is currently capable of collecting.
  • api/githubprofilescraper.json?profile=<profile_name>&terms=<list_containing_required_fields> : This will return the basic information about a profile and data on selective fields as mentioned by the user.

For example – api/githubprofilescraper.json?profile=djmgit&terms=followers,following,organizations

The above request will return data about followers, following, and organizations apart from basic profile information. Thus the scraper will scrape only those data that is required by the user.

Future scope

‘scrapeGithubUser’ is still lengthy and it can be further broken down into smaller methods. The part of the method which scrapes basic user information like name, bio, etc can be moved to a separate method. This will further increase code readability and maintainability.

Now we have a much more functional and enhanced Github profile scraper which provides more data than the previous one.

Deepjyoti Mondal

A web and mobile application developer and an enthusiastic learner. I like trying out new technologies. JavaScript and Python are my favorites

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.