In the previous blog post, I discussed how to fetch the URLs for Twitter videos in parts (.ts extension). But getting a video in parts is not beneficial as the loklak users have to carry out the following task in order to make sense out of it:
- Placing the videos in correct order (the videos are divided into 3-second sections).
- Having proper libraries and video player to play the .ts extension.
This would require fairly complex loklak clients and hence the requirement was to have complete video in a single link with a popular extension. In this blog post, I’ll be discussing how I managed to get links to complete Twitter videos.
Guests and Twitter Videos
Most of the content on Twitter is publicly accessible and we don’t need an account to access it. And this public content includes videos too. So, there should be some way in which Twitter would be handling guest users and serving them the videos. We needed to replicate the same flow in order to get links to those videos.
Problem with Twitter video and static HTML
Let us consider the video from a tweet of @HiHonourIndia –
We can see that the page is rendered using ReactJS and we also have the direct link for the video –
“So what’s the problem then? We can just request the web page and parse HTML to get video link, right?”
Wrong. As I mentioned earlier, the pages are rendered using React and when we initially request it, it looks something like this –
The HTML contains no link to video whatsoever, and keeping in mind that we would be getting the previously mentioned HTML, the scraper wouldn’t be getting any video link either.
We, therefore, need to mimic the flow which is followed internally in the web app to get the video link and play them.
Mimicking the flow of Twitter Mobile to get video links
After tracking the XHR requests made to by the Twitter Mobile web app, one can come up with the forthcoming mentioned flow to get video URLs.
Mobile URL for a Tweet
Getting mobile URL for a tweet is very simple –
Here, tweet URL is of the type /user/tweetID.
Guest Token and Bearer JS URL
The Bearer JS is a file which contains Bearer Token which along with a Guest Token is used to authenticate Twitter API to get details about a conversation. The guest token and bearer script URL can be extracted from the static mobile page –
Getting Bearer Token from Bearer JS URL
The following simple method can be used to fetch the Bearer Token from URL –
Using the Guest Token and Bearer Token to get Video Links
The following method demonstrates the process of getting video links once we have all the required information –
Checking if a Tweet contains video
If a tweet contains a video, we can add the following lines to recognise it in TwitterScraper.java –
Though this method successfully extracts the video links to complete Twitter videos, it makes the scraping process very slow. This is because, for every tweet that contains a video, three HTTP requests are made in order to finalise the tweet. And keeping in mind that there are up to 20 Tweets per search from Twitter, we get instances where more than 10 of them are videos (30 HTTP requests). Also, there is a lot of JSON and regex processing involved which adds a little to the whole “slow down” thing.
This post explained how loklak server was improved to fetch links to complete video URLs from Twitter and the exact flow of requests in order to achieve so. The changes were proposed in pull requests loklak/loklak_server#1206.
- TwitterScraper.java after the patch – https://github.com/loklak/loklak_server/blob/7c513ae7949718399a7cf06bb19e80da362ef396/src/org/loklak/harvester/TwitterScraper.java.
- Twitter API documentation – https://dev.twitter.com/docs.
- Monitoring XHR requests in Chrome web browser – https://www.codexworld.com/how-to/monitor-ajax-requests-google-chrome/.
- React dev tools (chrome extension) on Github – https://github.com/facebook/react-devtools.
- A blog post explaining about utilising XHR requests to scrape – https://www.codementor.io/codementorteam/how-to-scrape-an-ajax-website-using-python-qw8fuitvi.