The query server can be used to search a keyword/phrase on a search engine (Google, Yahoo, Bing, Ask, DuckDuckGo and Yandex) and get the results as json or xml. The tool also stores the searched query string in a MongoDB database for analytical purposes. (The search engine scraper is based on the scraper at fossasia/searss.)
In this blog, we will talk about how to install Query-Server and implement the search engine of your own choice as an enhancement.
How to clone the repository
Sign up / Login to GitHub and head over to the Query-Server repository. Then follow these steps.
1. Go ahead and fork the repository
https://github.com/fossasia/query-server
2. Star the repository
3. Get the clone of the forked version on your local machine using
git clone https://github.com/<username>/query-server.git
4. Add upstream to synchronize repository using
git remote add upstream https://github.com/fossasia/query-server.git
Getting Started
The Query-Server application basically consists of the following :
1. Installing Node.js dependencies
npm install -g bower
bower install
2. Installing Python dependencies (Python 2.7 and 3.4+)
pip install -r requirements.txt
3. Setting up MongoDB server
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo service mongod start
4. Now, run the query server:
python app/server.py
Go to http://localhost:7001/
How to contribute :
Add a search engine of your own choice
You can add a search engine of your choice apart from the existing ones in application.
- Just add or edit 4 files and you are ready to go.
For adding a search engine ( say Exalead ) :
1. Add exalead.py in app/scrapers directory :
from __future__ import print_function
from generalized import Scraper
class Exalead(Scraper): # Exalead class inheriting Scraper
"""Scrapper class for Exalead"""
def __init__(self):
self.url = 'https://www.exalead.com/search/web/results/'
self.defaultStart = 0
self.startKey = ‘start_index’
def parseResponse(self, soup):
""" Parses the reponse and return set of urls
Returns: urls (list)
[[Tile1,url1], [Title2, url2],..]
"""
urls = []
for a in soup.findAll('a', {'class': 'title'}): # Scrap data with the corresponding tag
url_entry = {'title': a.getText(), 'link': a.get('href')}
urls.append(url_entry)
return urls
Here, scraping data depends on the tag / class from where we could find the respective link and the title of the webpage.
2. Edit generalized.py in app/scrapers directory
from __future__ import print_function
import json
import sys
from google import Google
from duckduckgo import Duckduckgo
from bing import Bing
from yahoo import Yahoo
from ask import Ask
from yandex import Yandex
from exalead import Exalead # import exalead.py
scrapers = {
'g': Google(),
'b': Bing(),
'y': Yahoo(),
'd': Duckduckgo(),
'a': Ask(),
'yd': Yandex(),
't': Exalead() # Add exalead to scrapers with index ‘t’
}
From the scrapers dictionary, we could find which search engines had supported the project.
3. Edit server.py in app directory
@app.route('/api/v1/search/<search_engine>', methods=['GET'])
def search(search_engine):
try:
num = request.args.get('num') or 10
count = int(num)
qformat = request.args.get('format') or 'json'
if qformat not in ('json', 'xml'):
abort(400, 'Not Found - undefined format')
engine = search_engine
if engine not in ('google', 'bing', 'duckduckgo', 'yahoo', 'ask', ‘yandex' ‘exalead’): # Add exalead to the tuple
err = [404, 'Incorrect search engine', qformat]
return bad_request(err)
query = request.args.get('query')
if not query:
err = [400, 'Not Found - missing query', qformat]
return bad_request(err)
Checking, if the passed search engine is supporting the project, or not.
4. Edit index.html in app/templates directory
<button type="submit" value="ask" class="btn btn-lg search btn-outline"><img src="{{ url_for('static', filename='images/ask_icon.ico') }}" width="30px" alt="Ask Icon"> Ask</button>
<button type="submit" value="yandex" class="btn btn-lg search btn-outline"><img src="{{ url_for('static', filename='images/yandex_icon.png') }}" width="30px" alt="Yandex Icon"> Yandex</button>
<button type="submit" value="exalead" class="btn btn-lg search btn-outline"><img src="{{ url_for('static', filename='images/exalead_icon.png') }}" width="30px" alt="Exalead Icon"> Exalead</button> # Add button for exalead
Scrape the data using the anchor tag having specific class name.
For example, searching fossasia using exalead
https://www.exalead.com/search/web/results/?q=fossasia&start_index=1
Here, after inspecting element for the links, you will find that anchor having class name as title is having the link and title of the webpage. So, scrap data using title classed anchor tag.
Similarly, you can add other search engines as well.
Resources
You must be logged in to post a comment.