I was creating samples for Open Event Android and Open Event Webapp when the idea of web scraping through scripts stuck me.
I chose Python for web scraping. But what’s more to using Python other than it’s user-friendliness, easy syntax and speed? It’s the wide variety of open-source libraries that come along with it!
One of them is the BeautifulSoup library for extracting data from XML and HTML tags and files. It provides with ways to search and sort through webpages, find specific elements that you need and extract them in objects of your preference.
Apart from BeautifulSoup, another module that can be easily neglected but is of great use is:- urllib2. This module is to open URLs that are fed to it.
Hence, with a combination of the above 2 Python modules, I was able to scrape off the main sample data like speakers, sessions, tracks, sponsors and so on for PyCon 2017 (sample that I created recently) in a quicker and more efficient way.
I will now talk about these modules in greater detail and their use in my web scraping scripts.
Web Scraping with BeautifulSoup
Start by installing Beautifulsoup module in your Python environment. I believe that it will be best to install it using pip. Run the following command to install Beautifulsoup:-
pip install beautifulsoup4
Next up, you need to import the module to your script by adding the following line:-
from bs4 import BeautifulSoup
Now you need to open a webpage using the module in the following way. For that we first have to open the URL of the page that is being scraped. We use urllib2 to do that.
import urllib2
site=urllib2.urlopen("https://us.pycon.org/2017/sponsors/")
Next we use BeautifulSoup to get the contents of the webpage in the following way:-
website=BeautifulSoup(site)
Suppose that, next I want to obtain the data inside all the <div> tags in the webpage that have their class=”sponsor-level”.
I will use BeautifulSoup in the following way along with the ‘find_all’ method to do so :-
divs=website.find_all("div",{'class':'sponsor-level'})
Here, divs will now be a special <list> type element containing all the HTML in the div tags of class=’sponsor-level’. Now I can do whatever with I want with this list. Parse it in any way, store it any kind of data type without any difficulty.
BeautifulSoup also helps in accessing the children tags or elements of different HTML or XML tags using the ‘dot’ (.) operator. Following example will make it more clear:-
tag = soup.b
Here, ‘tag’ will contain all the <b> elements of the parent element that is stored in ‘soup’ variable.
To access attributes of HTML or XML tags, there is a special way as demonstrated below:-
To access the ‘href’ attribute of the first <a> tag in the webpage:-
soup.find_all(“a”)[0][‘href’]
A sample script…
Following is a script that I used to obtain sponsors from the PyCon 2017 site in json format.
import urllib2 site=urllib2.urlopen("https://us.pycon.org/2017/sponsors/") from bs4 import BeautifulSoup s=BeautifulSoup(site) divs=s.find_all("div",{'class':'sponsor-level'}) spons=[] for level in divs: levelname=level.find_all("h2")[0].string level_sponsors=level.find_all("div",{'class':'sponsor'}) for level_sponsor in level_sponsors: anchor=level_sponsor.find_all("a") dictionary={} dictionary['id']="" dictionary['name']=anchor[1].string dictionary['level']="" dictionary['sponsor_Type']=levelname dictionary['url']=anchor[0]['href'] dictionary['description']=level_sponsor.find_all("p")[1].string spons.append(dictionary) print spons
There are still a lot of other functionalities of BeautifulSoup that are available and can make your life easier. These can be found here in the official BeautifulSoup documentation.