I wanted a quick way to run some XPath selectors against a web page today. Nokogiri comes with a command line tool that you can pass a url and it will drop you into an IRB session. This allows you to play around with some Ruby code to explore a webpage before scraping it.
This is useful, but I wanted to use it with Pry. It turns out that adding support for Pry is relatively easy, but I couldn’t find any clear top to bottom instructions, so I’ve documented the process below.
First install Nokogiri and Pry:
gem install nokogiri pry
Then add the following code to
That’s it! Now when you use the
nokogiri command line tool it will now drop you into a pry REPL. This is perfect for testing your CSS and XPath selectors when you’re writing a scraper.
If you’ve followed along my previous two blog posts, Web Scraping with Ruby and Advanced web scraping with Mechanize then you’ll now have the knowledge needed to write a basic web scraper for getting structured data from the web.
The next logical step is to actually run these scrapers regularly so you can get information that’s constantly up-to-date. This is where the excellent morph.io from the talented folks at OpenAustralia comes into play.
Morph.io bills itself as “A Heroku for Scrapers”. You can choose to either run your scrapers manually, or have them run automatically for you every day. Then you can use the morph.io API to extract the data for use in your application as JSON, CSV or you can download a sqlite database containing the scraped data.
Morph.io fills the gap that Scraperwiki Classic left. Morph.io scrapers are hosted on GitHub, which means you can fork them and fix them if they break in the future.
Creating a scraper
We’ll use the code from the Pitchfork Scraper in my previous post to demonstrate how easy it is to get your scraper running on morph.io.
You can sign into morph.io with a GitHub account. Once signed in you can then create a scraper. Currently morph.io supports scrapers written in Ruby, PHP, Python or Perl, choose a language and give your scraper a name, I’m calling mine
pitchfork_scraper. Then press the “Create Scraper” button to create a new GitHub repository containing skeleton code for a scraper in your chosen language.
Clone the repository that was created in the previous step, in my case I can use the following:
git clone https://github.com/chrismytton/pitchfork_scraper
The repository will contain a
README.md and a
Morph.io expects two things from your scraper. First the scraper repository should contain a
scraper.rb file for Ruby scrapers 1, second the scraper itself should write to a sqlite3 database file called
data.sqlite. In order to change this in our scraper we need to make a small change so it writes to a database rather than to JSON on STDOUT.
First add the code from the previous post into
scraper.rb, then you can change the code to use the
scraperwiki gem to write to the sqlite database.
This uses the
ScraperWiki.save_sqlite method to save the review in the database. The first argument is the list of fields that in combination should be considered unique. In this case we’re using the artist and album, since it’s unlikely that an artist would release two albums with the same name.
You’ll need to install the Ruby
scraperwiki gem in addition to the other dependencies to run this code locally.
gem install scraperwiki
Then you can run this code on your local machine with the following:
This will create a new file in the current directory called
data.sqlite which will contain the scraped data.
Running the scraper on morph.io
Now you’ve made the changes to your scraper you can run the code on morph.io. First commit your changes using
git push the changes to the scrapers GitHub repository.
You can then run the scraper and the results should be added to the corresponding sqlite database on morph.io. It should look something like the following:
As you can see the data is now available to authorized users as either JSON, CSV or you can download the sqlite database and use that locally.
The code for the scraper is available on GitHub. You can see the output from the scraper on morph.io morph.io/chrismytton/pitchfork_scraper. Note that you’ll need to sign in with GitHub in order to access and manipulate the data over the API.
This article should give you enough background to start hosting your scrapers on morph.io. In my opinion it’s an awesome service that takes the hassle out of running and maintaining scrapers and leaves you to concentrate on the unique parts of your application.
Go forth and get structured data out of the web!
scraper.phpfor PHP or
scraper.plfor Perl ↩
In my last post I gave a basic introduction to web scraping with Ruby and Nokogiri. At the end of that post I mentioned that for more “advanced” scraping Mechanize was worth looking into.
This post explains how to do some more advanced web scraping using Mechanize, which builds on top of Nokogiri’s excellent HTML processing support.
Scraping Pitchfork reviews
Mechanize provides an out-of-the-box scraping solution that can handle filling in forms, following links and respecting a site’s robots.txt file. Here I’ll show you how it can be used to scrape the latest reviews from Pitchfork 1.
Reviews are spread across multiple pages, so we can’t simply fetch a single page and parse it with Nokogiri. This is where Mechanize can help with its ability to click on links and follow them to other pages.
First we’ll need to install Mechanize and its dependencies from Rubygems.
$ gem install mechanize
With Mechanize installed we can now start writing our scraper. Create a file called
scraper.rb and add the following
require statements. These specify the dependencies we need for this script.
json are part of Ruby’s standard library, so there’s no need to install them separately.
Now we can start using Mechanize. First thing we need to do is create a new instance of Mechanize (
agent) and then use it to fetch a remote webpage (
Find links to reviews
Now we can use the
page object to find links to reviews. Mechanize provides a
.links_with method which, as the name suggests, finds links with the given attributes. Here we look for links which match a regular expression.
This returns an array of links, but we only want links to reviews, not pagination. To remove unwanted links we can call
.reject on the array of links and reject any which look like pagination links.
For the purposes of demonstration—and so we don’t completely hammer Pitchfork’s server’s—we’ll just take the first four review links.
Process each review
We now have a list of Mechanize links which we want to map to the reviews that they link to. Since they’re in an array we can call
.map on it and return a hash from each iteration.
page object has a
.search method which delegates to Nokogiri’s
.search method. This means that we can use a CSS selector as an argument to
.search and it will return an array of matching elements.
Here we first get the review metadata using the CSS selector
#main .review-meta .info and then search inside the
review_meta element for the various bits of information that we need.
Now we’ve got an array of review hashes we can output the reviews in JSON format.
All together now
Here’s the whole script:
Put this code in a file called
scraper.rb and run it with the following.
$ ruby scraper.rb
And it should output something like this:
If you want, you can save this JSON to a file by redirecting standard out to a file.
$ ruby scraper.rb > reviews.json
This only scratches the surface of Mechanize. One thing I haven’t even touched on is it’s ability to fill in and submit forms. If you’re interested in learning more then I recommend you look at the Mechanize guide and Mechanize examples.
A lot of people commented that my previous post should have just used Mechanize from the off. While I agree that Mechanize is a great tool, for simple tasks like the one I presented, at the time it seemed to me like a bit of an overkill.
However on reflection the fact that Mechanize handles fetching the remote webpage and respects robots.txt files makes me think that, even for non-advanced scraping tasks, Mechanize will often be the best tool for the job.
Update Jan 22: Check out the next post in this series: Advanced web scraping with Mechanize.
Scraping the web with Ruby is easier than you might think. Let’s start with a simple example, I want to get a nicely formatted JSON array of objects representing all the showings for my local independent cinema.
First we need a way to download the html page that has all the listings on it. Ruby comes with an http client,
Net::HTTP, and it also comes with a nice wrapper around it,
open-uri 1. So the first thing we do is grab the html from the remote server.
Great, so we’ve got the page that we want to scrape, now we need to extract some information from it. The best tool for this job is Nokogiri. So we create a new Nokogiri instance using the html we just scraped.
Nokogiri is great because it allows us to query the html using CSS selectors, which, in my opinion, is much simpler than using xpath.
Ok, now we’ve got a document that we can query for the cinema listings. Each individual listing’s html structure is something like the following.
Processing the html
Each showing has the class
.showing, so we can select all the showings on the page and loop over them, processing each one in turn.
Lets break down the code above and see what each part is doing.
First we get the showing’s unique id, which is helpfully exposed as part of the html id attribute in the markup. Using square brackets allows us to access attributes of the element, so using the html above as an example the return value of
showing['id'] would be
"event_7557". We’re only interested in the integer id, so we split the resulting string on the underscore,
.split('_') and then take the last element from that array and convert it to an integer,
Here we find all the tags for a showing by using the
.css method, which returns an array of matching elements. We then map these elements and take the text content and strip it of any excess whitespace. For the html above this would return
["comedy", "dvd", "film"].
The code to get the title is a bit more involved because the title element in the html contains some extra spans with a prefix and a suffix. First we get the title element using
.at_css, which returns a single matching element. Then we loop over the children of the title element and remove any spans. Finally with the spans gone we get the text of the title element and strip out any excess whitespace.
This is the code for getting the date and time of a showing. It’s a bit involved because a showing can be on multiple days, and sometimes there is also pricing information in the same element. We’re mapping the dates that we find to
DateTime.parse so that the result is an array of ruby
Getting the description is quite straightforward, the only real processing we have to do is remove the
[more...] text using
With all the bits of the showing that we want in variables we can now push a hash representing the showing into our array of showings.
Now we’ve processed each showing and we’ve got an array of showings we can convert the result to JSON.
This prints out the JSON encoded version of the showings, when running the script the output can be redirected to a file, or piped into another program for further processing.
Putting it all together
With all the pieces in place we can now put the full version of the script together.
If you save the above into a file called
scraper.rb and run it with
ruby scraper.rb then you should see the JSON representation of the events printed to stdout. It will look something like the following.
And that’s it! This is just a basic example of scraping. Things get a bit more complicated if the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize, which builds on top of Nokogiri.
Hopefully this introduction to scraping has given you some ideas for data that you want to turn into a more structured format using the scraping techniques described above.
My minimal banana bread recipe. This is very much at the bread end of the bread/cake spectrum, it’s not very sweet other than the sweetness from the banana. If you want more sweetness then you can add some sugar, soft brown if possible, up to 100g added after the butter. Recipe can be used as a base and then added to, for example after mixing in the flour you could add a handful of walnuts or raisins, some cinnamon, vanilla essence, honey, peanut butter etc, anything that might go with banana really!
- 2 eggs
- 100g butter
- 2 ripe bananas
- 200g self-raising flour
- Preheat the oven to 180C (160C fan).
- Grease a loaf tin roughly 21x9x7cm and line with baking parchment
- Put the eggs into a large bowl and whisk them
- Melt the butter then combine with the eggs and mix thoroughly.
- Mash the bananas and fold them into the butter and egg mixture.
- Add the flour and mix until just combined. If the mixture is too loose then add more flour.
- Spoon into the tin and bake for about 40 minutes until a skewer inserted into the middle comes out clean.
- Cool in the tin for 10 minutes before turning out on to a rack to cool completely.
A couple of weeks ago we brewed our first ever batch of beer. Rather than start with something simple like an extract brew, I decided to create my own all-grain American IPA recipe. The beer is ready to be bottled now, then it will need another 2 weeks in the bottles for conditioning. By mid-January we will finally be able to taste it, hopefully it will at least resemble drinkable beer.
subscribe via RSS