Update Jan 22: Check out the next post in this series: Advanced web scraping with Mechanize.
Scraping the web with Ruby is easier than you might think. Let’s start with a simple example, I want to get a nicely formatted JSON array of objects representing all the showings for my local independent cinema.
First we need a way to download the html page that has all the listings on it. Ruby comes with an http client,
Net::HTTP, and it also comes with a nice wrapper around it,
open-uri 1. So the first thing we do is grab the html from the remote server.
Great, so we’ve got the page that we want to scrape, now we need to extract some information from it. The best tool for this job is Nokogiri. So we create a new Nokogiri instance using the html we just scraped.
Nokogiri is great because it allows us to query the html using CSS selectors, which, in my opinion, is much simpler than using xpath.
Ok, now we’ve got a document that we can query for the cinema listings. Each individual listing’s html structure is something like the following.
Processing the html
Each showing has the class
.showing, so we can select all the showings on the page and loop over them, processing each one in turn.
Lets break down the code above and see what each part is doing.
First we get the showing’s unique id, which is helpfully exposed as part of the html id attribute in the markup. Using square brackets allows us to access attributes of the element, so using the html above as an example the return value of
showing['id'] would be
"event_7557". We’re only interested in the integer id, so we split the resulting string on the underscore,
.split('_') and then take the last element from that array and convert it to an integer,
Here we find all the tags for a showing by using the
.css method, which returns an array of matching elements. We then map these elements and take the text content and strip it of any excess whitespace. For the html above this would return
["comedy", "dvd", "film"].
The code to get the title is a bit more involved because the title element in the html contains some extra spans with a prefix and a suffix. First we get the title element using
.at_css, which returns a single matching element. Then we loop over the children of the title element and remove any spans. Finally with the spans gone we get the text of the title element and strip out any excess whitespace.
This is the code for getting the date and time of a showing. It’s a bit involved because a showing can be on multiple days, and sometimes there is also pricing information in the same element. We’re mapping the dates that we find to
DateTime.parse so that the result is an array of ruby
Getting the description is quite straightforward, the only real processing we have to do is remove the
[more...] text using
With all the bits of the showing that we want in variables we can now push a hash representing the showing into our array of showings.
Now we’ve processed each showing and we’ve got an array of showings we can convert the result to JSON.
This prints out the JSON encoded version of the showings, when running the script the output can be redirected to a file, or piped into another program for further processing.
Putting it all together
With all the pieces in place we can now put the full version of the script together.
If you save the above into a file called
scraper.rb and run it with
ruby scraper.rb then you should see the JSON representation of the events printed to stdout. It will look something like the following.
And that’s it! This is just a basic example of scraping. Things get a bit more complicated if the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize, which builds on top of Nokogiri.
Hopefully this introduction to scraping has given you some ideas for data that you want to turn into a more structured format using the scraping techniques described above.