Data Scraping with Selenium

There still exists websites without any APIs. Scraping data from such sites can be very time-consuming and manual. I created samples for Open Event App generator. One of the samples that I created was for All Hands Hawaii 2016. This site didn’t have any API to enable easy data scraping.

How do we find out if a website is using an API or not?

Using Google Chrome, go to View → Developer → Developer Tools. Under the Network →XHR look for API endpoint with a bit of Hit and Trial method. (XHR stands for XMLHttpRequest)

However, what if there is no API being used in the site? How would you scrape data in that case? Will you now manually click onto every hyperlink on the site and visit every page to get the data by manually copying and pasting it? Could there be someone doing that manual job for you? Or better could there be “something” doing that job for you? Yes, It’s selenium.

Selenium Web Browser Automation

Selenium is a tool that automates the task of browsing through the internet. Although, technically it is used for web testing purposes but there is no restriction to it’s utility.

Let’s get started with basics of Selenium:-

INSTALLATION:
  1. Run the following command pip install selenium (Quick Tip: It is advised to use virtualenv)
  2. Selenium requires drivers to run. Different browsers use different drivers. Choose an appropriate driver for your browser. some common drivers are shown below (Source)-

Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads

Firefox: https://github.com/mozilla/geckodriver/releases

Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

BASIC FUNCTIONALITIES-SELENIUM:

Visit a page ( using the get() ):

 driver.get(urlofpage)

Navigating to various elements on the visited/current webpage:

  1. BY ID: 
    WebElement element = driver.findElement(By.id(“ui_elementid”));
  2. BY CLASS NAME:
    List<WebElement> cheeses = driver.findElements(By.className(“cheese”));
  3. BY TAG NAME:
    WebElement tag = driver.findElement(By.tagName(“tag_name”));
  4. BY CSS: WebElement cs = driver.findElement(By.cssSelector(“#”));
  5. BY LINK TEXT:
    WebElement cheese = driver.findElement(By.linkText(“blog”));

    If the element href is something like urlofpage?q=blog.

  6. BY XPATH:
  7.  List<WebElement> xp = driver.findElements(By.xpath(“//input”))

 

We can also use JavaScript along with Selenium. This might be a helpful thread for the same.

Another really good link for the same is http://www.marinamele.com/selenium-tutorial-web-scraping-with-selenium-and-python