Loklak Scraper JS implements scrapers of social media websites so that they can be used in other platforms, like Android or in a native Java project. This way there will be only a single source of scraper, as a result it will be easier to update the scrapers in response to the change in websites. This blog explains how Loklak Wok Android, a peer for Loklak Server on Android platform uses the Twitter JS scraper to scrape tweets.
LiquidCore is a library available for android that can be used to run standard NodeJS modules. But Twitter scraper can’t be used directly, due to the following problems:
- 3rd party NodeJS libraries are used to implement the scraper, like cheerio and request-promise-native and LiquidCore doesn’t support 3rd party libraries.
- The scrapers are written in ES6, as of now LiquidCore uses NodeJS 6.10.2, which doesn’t support ES6 completely.
So, if 3rd party NodeJS libraries can be included in our scraper code and ES6 can be converted to ES5, LiquidCore can easily execute Twitter scraper.
3rd party NodeJS libraries can be bundled into Twitter scraper using Webpack and ES6 can be transpiled to ES5 using Babel.
The required dependencies can be installed using:
$npm install --save-dev webpack
$npm install --save-dev babel-core babel-loader babel-preset-es2015
Bundling and Transpiling
Webpack does bundling based on the configurations provided in webpack.config.js, present in root directory of project.
var fs = require('fs');
function listScrapers() {
var src = "./scrapers/"
var files = {};
fs.readdirSync(src).forEach(function(data) {
var entryName = data.substr(0, data.indexOf("."));
files[entryName] = src+data;
});
return files;
}
module.exports = {
entry: listScrapers(),
target: "node",
module: {
loaders: [
{
loader: "babel-loader",
test: /\.js?$/,
query: {
presets: ["es2015"],
}
},
]
},
output: {
path: __dirname + '/build',
filename: '[name].js',
libraryTarget: 'var',
library: '[name]',
}
};
Now let’s break the config file, the function listScrapers returns a JSONObject with key as name of scraper and value as relative location of scraper, ex:
{
twitter: "./scrapers/twitter.js",
github: "./scrapers/github.js"
// same goes for other scrapers
}
The parameters in module.exports as described in the documentation of webpack for multiple inputs and to use the generated output externally:
- entry: Since a bundle file is required for each scraper we provide the the JSONObject returned by listScrapers function. The multiple entry points provided generate multiple bundled files.
- target: As the bundled files are to be used in NodeJS platform, “node” is set here.
- module: Using webpack the code can be directly transpiled while bundling, the end users don’t need to run separate commands for transpiling. module contains babel configurations for transpiling.
- output: options here customize the compilation of webpack
- path: Location where bundled files are kept after compilation, “__dirname” means the current directory i.e. root directory of the project.
- filename: Name of bundled file, “[name]“ here refers to the key of JSONObject provided in entry i.e. key of JSONObect returned from listScrapers. Example for Twitter scraper, the filename of bundled file will be “twitter.js”.
- libraryTarget: by default the functions or methods inside bundled files can’t be used externally – can’t be imported. By providing the “var” the functions in bundled module can be accessed.
- library: the name of the library.
Now, time to do the compilation work:
$ ./node_modules/.bin/webpack
The bundled files can be found in build directory. But, the generated bundled files are large files – around 77,000 lines. Large files are not encouraged for production purposes. So, a “-p” flag is used to generate bundled files for production – around 400 lines.
$ ./node_modules/.bin/webpack -p
Using LiquidCore to execute bundled files
The generated bundled file can be copied to the raw directory in res (resources directory in Android). Now, events are emitted from Activity/Fragment and in response to those events the scraping function is invoked in the bundled JS file, present in raw directory, the vice-versa is also possible.
So, we handle some events in our JS file and send some events to the android Activity/Fragment. The event handling and event creating code in JS file:
var query = "";
LiquidCore.on("queryEvent", function(msg) {
query = msg.query;
});
LiquidCore.on("fetchTweets", function() {
var twitterScraper = new twitter();
twitterScraper.getTweets(query, function(data) {
LiquidCore.emit("getTweets", {"query": query, "statuses": data});
});
});
LiquidCore.emit('start');
First a “start” event is emitted from JS file, which is consumed in TweetHarvestingFragment by getScrapedTweet method using startEventListener.
EventListener startEventListener = (service, event, payload) -> {
JSONObject jsonObject = new JSONObject();
try {
jsonObject.put("query", query);
service.emit(LC_QUERY_EVENT, jsonObject); // value of LC_QUERY_EMIT is "queryEvent"
} catch (JSONException e) {
Log.e(LOG_TAG, e.toString());
}
service.emit(LC_FETCH_TWEETS_EVENT); //value of LC_FETCH_TWEETS_EVENT is "fetchTweets"
};
The startEventListener then emits “queryEvent” with a JSONObject that contains the query to search tweets for scraping. This event is consumed in JS file by:
var query = "";
LiquidCore.on("queryEvent", function(msg) {
query = msg.query;
});
After “queryEvent”, “fetchTweets” event is emitted from fragment, which is handled in JS file by:
LiquidCore.on("fetchTweets", function() {
var twitterScraper = new twitter(); // scraping object is created
twitterScraper.getTweets(query, function(data) { // function that scrapes twitter
LiquidCore.emit("getTweets", {"query": query, "statuses": data});
});
});
Once the scraped data is obtained, it is sent back to fragment by emitting “getTweets” event from JS file, “{“query”: query, “statuses”: data}” contains scraped data. This event is consumed in android by getTweetsEventListener.
EventListener getTweetsEventListener = (service, event, payload) -> { // payload contains scraped data
Push push = mGson.fromJson(payload.toString(), Push.class);
emitter.onNext(push);
};
LiquidCore creates a NodeJS instance to execute the bundled JS file. The NodeJS instance is called MicroService in LiquidCore terminology. For all this event handling to work, the NodeJS instance is created inside the method with a ServiceStartListner where all EventListener are added.
MicroService.ServiceStartListener serviceStartListener = (service -> {
service.addEventListener(LC_START_EVENT, startEventListener);
service.addEventListener(LC_GET_TWEETS_EVENT, getTweetsEventListener);
});
URI uri = URI.create("android.resource://org.loklak.android.wok/raw/twitter"); // Note .js is not used
MicroService microService = new MicroService(getActivity(), uri, serviceStartListener);
microService.start();
Resources
You must be logged in to post a comment.