URL Unshortening in Java for loklak server

There are many URL shortening services on the internet. They are useful in converting really long URLs to shorter ones. But apart from redirecting to a longer URL, they are often used to track the people visiting those links.

One of the components of loklak server is its URL unshortening and redirect resolution service, which ensures that websites can’t track the users using those links and enhances the protection of privacy. How this service works in loklak.

Redirect Codes in HTTP

Various standards define 3XX status codes as an indication that the client must perform additional actions to complete the request. These response codes range from 300 to 308, based on the type of redirection.

To check the redirect code of a request, we must first make a request to some URL –

String urlstring = "http://tinyurl.com/8kmfp";
HttpRequestBase req = new HttpGet(urlstring);

Next, we will configure this request to disable redirect and add a nice Use-Agent so that websites do not block us as a robot –

req.setConfig(RequestConfig.custom().setRedirectsEnabled(false).build());
req.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36");

Now we need a HTTP client to execute this request. Here, we will use Apache’s CloseableHttpClient

CloseableHttpClient httpClient = HttpClients.custom()
                                   .setConnectionManager(getConnctionManager(true))
                                   .setDefaultRequestConfig(defaultRequestConfig)
                                   .build();

The getConnctionManager returns a pooling connection manager that can reuse the existing TCP connections, making the requests very fast. It is defined in org.loklak.http.ClientConnection.

Now we have a client and a request. Let’s make our client execute the request and we shall get an HTTP entity on which we can work.

HttpResponse httpResponse = httpClient.execute(req);
HttpEntity httpEntity = httpResponse.getEntity();

Now that we have executed the request, we can check the status code of the response by calling the corresponding method –

if (httpEntity != null) {
   int httpStatusCode = httpResponse.getStatusLine().getStatusCode();
   System.out.println("Status code - " + httpStatusCode);
} else {
   System.out.println("Request failed");
}

Hence, we have the HTTP code for the requests we make.

Getting the Redirect URL

We can simply check for the value of the status code and decide whether we have a redirect or not. In the case of a redirect, we can check for the “Location” header to know where it redirects.

if (300 <= httpStatusCode && httpStatusCode <= 308) {
   for (Header header: httpResponse.getAllHeaders()) {
       if (header.getName().equalsIgnoreCase("location")) {
           redirectURL = header.getValue();
       }
   }
}

Handling Multiple Redirects

We now know how to get the redirect for a URL. But in many cases, the URLs redirect multiple times before reaching a final, stable location. To handle these situations, we can repeatedly fetch redirect URL for intermediate links until we saturate. But we also need to take care of cyclic redirects so we set a threshold on the number of redirects that we have undergone –

String urlstring = "http://tinyurl.com/8kmfp";
int termination = 10;
while (termination-- > 0) {
   String unshortened = getRedirect(urlstring);
   if (unshortened.equals(urlstring)) {
       return urlstring;
   }
   urlstring = unshortened;
}

Here, getRedirect is the method which performs single redirect for a URL and returns the same URL in case of non-redirect status code.

Redirect with non-3XX HTTP status – meta refresh

In addition to performing redirects through 3XX codes, some websites also contain a <meta http-equiv=”refresh” … > which performs an unconditional redirect from the client side. To detect these types of redirects, we need to look into the HTML content of a response and parse the URL from it. Let us see how –

String getMetaRedirectURL(HttpEntity httpEntity) throws IOException {
   StringBuilder sb = new StringBuilder();
   BufferedReader reader = new BufferedReader(new InputStreamReader(httpEntity.getContent()));
   String content = null;
   while ((content = reader.readLine()) != null) {
       sb.append(content);
   }
   String html = sb.toString();
   html = html.replace("\n", "");
   if (html.length() == 0)
       return null;
   int indexHttpEquiv = html.toLowerCase().indexOf("http-equiv=\"refresh\"");
   if (indexHttpEquiv < 0) {
       return null;
   }
   html = html.substring(indexHttpEquiv);
   int indexContent = html.toLowerCase().indexOf("content=");
   if (indexContent < 0) {
       return null;
   }
   html = html.substring(indexContent);
   int indexURLStart = html.toLowerCase().indexOf(";url=");
   if (indexURLStart < 0) {
       return null;
   }
   html = html.substring(indexURLStart + 5);
   int indexURLEnd = html.toLowerCase().indexOf("\"");
   if (indexURLEnd < 0) {
       return null;
   }
   return html.substring(0, indexURLEnd);
}

This method tries to find the URL from meta tag and returns null if it is not found. This can be called in case of non-redirect status code as a last attempt to fetch the URL –

String getRedirect(String urlstring) throws IOException {
   ...
   if (300 <= httpStatusCode && httpStatusCode <= 308) {
       ...
   } else {
       String metaURL = getMetaRedirectURL(httpEntity);
       EntityUtils.consumeQuietly(httpEntity);
       if (metaURL != null) {
           if (!metaURL.startsWith("http")) {
               URL u = new URL(new URL(urlstring), metaURL);
               return u.toString();
           }
           return metaURL;
       }
   return urlstring;
   }
   ...
}

In this implementation, we can see that there is a check for metaURL starting with http because there may be relative URLs in the meta tag. The java.net.URL library is used to create a final URL string from the relative URL. It can handle all the possibilities of a valid relative URL.

Conclusion

This blog post explains about the resolution of shortened and redirected URLs in Java. It explains about defining requests, executing them using an HTTP client and processing the resultant response to get a redirect URL. It also explains about how to perform these operations repeatedly to process multiple shortenings/redirects and finally to fetch the redirect URL from meta tag.

Loklak uses an inbuilt URL shortener to resolve redirects for a URL. If you find this blog post interesting, please take a look at the URL shortening service of loklak.

Published by

Pratyush

GSoC 2017 @fossasia | B.Tech. CSE @iiitv | Python freak | Backend Developer | Database and API Designer | Big Data Engineer | Data Scientist | OSS lover