Web Scraping can most succintly be described as "Creating an API where there is none". It is mainly used to harvest data from the web that cannot be easily downloaded manually/does not provide an option for direct download. Scraped data can be used for a variety of purposes like online price comaprison, detecting changes in web page content, real-time data integration and web mashups.

http://www.bitquill.net/blog/wp-content/uploads/2008/07/pack_of_harvesters.jpg

Scripting languages are best suited for web scraping as they provide an interactive interpreter which helps a lot when you are developing the scraper. The option to try out new Xpath combinations and being able to get instant feedback saves a huge amout of time while developing. Scripting languages like Python, Ruby and Perl are popular choices for web scraping. These languages also have a large number of libraries that help in fetching and extracting data.

Most of the popular libraries like 'Mechanize' have been ported to most of the popular scripting languages. We have chosen 'Ruby' Language here since it offers a solid set of libraries, comprehensive documentation and a huge user base.

Considerations before Web Scraping

  1. Is the content being scraped copyright protected?
  2. Does the website allow for web scraping?
  3. Is there any data protection policies applied by the servers?

The first two questions can easily be answered by consulting the robots.txt file/reading the website's terms and condtions or contacting the website administrators.

In this article, we will be scraping all the Reference Links and the Further Reading text from wikipedia Ruby language introduction page using Mechanize and Nokogiri gems.

Mechanize will primarily be used to fetch the pages and Nokogiri will be used to find specific elements to extract from the page. Mechanize can be used to do a lot of cool things like submitting a form, following links in a page and so on. But, for the purpose of this tutorial, it will only be used to fetch the html markup. The url of the page which has to be scraped will be sent as a command line argument to the script.

Tools used

First things first, you will need the following Ruby version and Ruby gems to be installed on your machine.

  1. Ruby >= 1.9.3
  2. Mechanize gem.
  3. Nokogiri gem.

Installing Ruby Language

If you are using Windows, you can downloaed a binary installation file from the official Ruby website and install it.

If you are using Ubuntu, then the following command will install Ruby 1.9.3 on your machine.

$ sudo apt-get install ruby1.9.3

Installing Mechanize gem

$ sudo gem install mechanize

Installing Nokogiri gem

$ sudo gem install nokogiri

That's it. That is all you need to start scraping (most) of the web.

Code

The following section shows the code that this tutorial uses. We have named our file as 'scraper.rb' since Ruby language uses the ".rb" extension.

scraper.rb

require "mechanize"

url = ARGV[0]
fp = File.new("wikilinks.txt", "w")

agent = Mechanize.new { |agent| agent.user_agent_alias = "Mac Safari" }

html = agent.get(url).body

html_doc = Nokogiri::HTML(html)

fp.write("References\n\n")

list = html_doc.xpath("//ol[@class='references']")
list.each { |i| fp.write(i.text + "\n") }

fp.write("Further Reading\n\n")

list = html_doc.xpath("//span[@class='citation']")
list.each { |i| fp.write(i.text + "\n") }

Script usage

To run the script, navigate to the directory where you have stored the source file(scraper.rb) and execute the following command.

$ ruby scraper.rb "http://en.wikipedia.org/wiki/Ruby_(programming_language)" 

Code Description

Let's look at the explanation for the code line by line.

  • The first two lines import the required libraries(mechanize and nokogiri) into our script.

  • We open a new file to which we write the scraped content. The file is named 'wikilinks.txt' and is opened in the 'write' mode.

  • Next, we assign the url sent through the command line to the url variable.

  • Before we can fetch the html of the page, we need to create a new Mechanize object and identify ourself as a common user agent. This is done in line 7. The next line fetches the html source of the page.

  • The source html of the page is stored in the html variable at this point. All we need to do now is to extract the required contents from it. To do this, we create a new Nokogiri html document.

  • At this stage, we are all set to extract the data from the page.

  • There are a couple of ways to go about it if we are using Nokogiri.

    • Using Xpath
    • Using CSS.

      We will be following the first method since it is the easiest.

  • If you look at the source html of the page(you can do this by right clicking on the list and selecting 'Inspect element' in the browser), you can see that the references are all defined in an ordered list with the class being 'references'. This makes it really easy to extract the links since we can easily uniquely identify the data.

  • The data extracted is then stored in the list variable. We then cycle through the list variable and write it's contents to the output file.

  • To get the citations in the page using Xpath, we identify the data by referring to it's class citation through Xpath. We follow the same procedure we followed for writing the references section to the file.

Web Scraping considerations.

Here are a few considerations that you have to keep in mind when you are scraping the data.

Check if there is already an API in place. Always.

Wikipedia for example, already has an API through which you can get all the required data. API is a much cleaner and faster way to access data and the website servers will also have an easier time. We chose wikipedia for this tutorial simply because it has good structured html content and all the data is in public domain already.

Make sure you are not consuming the server resources greedily.

Serving pages on the web takes some amount of resources and it is easy for a server to get bogged down if there are too many requests. Make sure that you put a limit on the number of requests that you send per specific amount of time.

Always make sure that you follow the TOS of the site.

Websites usually have a set of rules that you have to follow once you get their data. Some websites may have a limit on how long you can store the data on your machine/how you can use their data. Being aware of their TOS and use the data accordingly.

There we go. That was a small and quick introduction to the world of Web Scraping and the Mechanize and Nokogiri gems.

Share the Love

Published in ruby | Tagged with ruby, scripting, ruby-gems

CATEGORIES
web-development
javascript
ruby
ruby-on-rails
tutorials
startups
products
events
devops
mobile

TAGS
web-development
javascript
frameworks
ruby
open-source
ruby-on-rails
tutorials
programming
ruby-gems
reactjs

MORE

RSS

X

Talk to us, that's always a good idea!