In Ruby, #find_all and #select are different (for Hashes)

In Ruby, Hash#select returns a Hash whereas Hash#find_all returns an Array.

This is because Ruby’s Hash class defines its own #select method, but inherits its #find_all method from the Enumerable module.

# select returns a Hash
{ foo: 1, bar: 2 }.select { |key, value| value.even? }
# => { :bar => 2 }

# find_all returns an Array
{ foo: 1, bar: 2 }.find_all { |key, value| value.even? }
# => [[:bar, 2]]

For more details see this StackOverflow answer.

Using nokogiri with pry

I wanted a quick way to run some XPath selectors against a web page today. Nokogiri comes with a command line tool that you can pass a url and it will drop you into an IRB session. This allows you to play around with some Ruby code to explore a webpage before scraping it.

nokogiri http://example.com

This is useful, but I wanted to use it with Pry. It turns out that adding support for Pry is relatively easy, but I couldn’t find any clear top to bottom instructions, so I’ve documented the process below.

First install Nokogiri and Pry:

gem install nokogiri pry

Then add the following code to ~/.nokogirirc:

require 'pry'
Nokogiri::CLI.console = Pry

That’s it! Now when you use the nokogiri command line tool it will now drop you into a pry REPL. This is perfect for testing your CSS and XPath selectors when you’re writing a scraper.

Web scraping with morph.io

If you’ve followed along my previous two blog posts, Web Scraping with Ruby and Advanced web scraping with Mechanize then you’ll now have the knowledge needed to write a basic web scraper for getting structured data from the web.

The next logical step is to actually run these scrapers regularly so you can get information that’s constantly up-to-date. This is where the excellent morph.io from the talented folks at OpenAustralia comes into play.

Morph.io bills itself as “A Heroku for Scrapers”. You can choose to either run your scrapers manually, or have them run automatically for you every day. Then you can use the morph.io API to extract the data for use in your application as JSON, CSV or you can download a sqlite database containing the scraped data.

Morph.io fills the gap that Scraperwiki Classic left. Morph.io scrapers are hosted on GitHub, which means you can fork them and fix them if they break in the future.

Creating a scraper

We’ll use the code from the Pitchfork Scraper in my previous post to demonstrate how easy it is to get your scraper running on morph.io.

You can sign into morph.io with a GitHub account. Once signed in you can then create a scraper. Currently morph.io supports scrapers written in Ruby, PHP, Python or Perl, choose a language and give your scraper a name, I’m calling mine pitchfork_scraper. Then press the “Create Scraper” button to create a new GitHub repository containing skeleton code for a scraper in your chosen language.

Clone the repository that was created in the previous step, in my case I can use the following:

git clone https://github.com/chrismytton/pitchfork_scraper

The repository will contain a README.md and a scraper.rb file.

Morph.io expects two things from your scraper. First the scraper repository should contain a scraper.rb file for Ruby scrapers 1, second the scraper itself should write to a sqlite3 database file called data.sqlite. In order to change this in our scraper we need to make a small change so it writes to a database rather than to JSON on STDOUT.

First add the code from the previous post into scraper.rb, then you can change the code to use the scraperwiki gem to write to the sqlite database.

diff --git a/scraper.rb b/scraper.rb
index 2d2baaa..f8b14d6 100644
--- a/scraper.rb
+++ b/scraper.rb
@@ -1,6 +1,8 @@
 require 'mechanize'
 require 'date'
-require 'json'
+require 'scraperwiki'
+
+ScraperWiki.config = { db: 'data.sqlite', default_table_name: 'data' }

 agent = Mechanize.new
 page = agent.get("http://pitchfork.com/reviews/albums/")
@@ -34,4 +36,6 @@ reviews = review_links.map do |link|
   }
 end

-puts JSON.pretty_generate(reviews)
+reviews.each do |review|
+  ScraperWiki.save_sqlite([:artist, :album], review)
+end

This uses the ScraperWiki.save_sqlite method to save the review in the database. The first argument is the list of fields that in combination should be considered unique. In this case we’re using the artist and album, since it’s unlikely that an artist would release two albums with the same name.

You’ll need to install the Ruby scraperwiki gem in addition to the other dependencies to run this code locally.

gem install scraperwiki

Then you can run this code on your local machine with the following:

ruby scraper.rb

This will create a new file in the current directory called data.sqlite which will contain the scraped data.

Running the scraper on morph.io

Now you’ve made the changes to your scraper you can run the code on morph.io. First commit your changes using git. Then git push the changes to the scrapers GitHub repository.

You can then run the scraper and the results should be added to the corresponding sqlite database on morph.io. It should look something like the following:

Screenshot of morph.io output

As you can see the data is now available to authorized users as either JSON, CSV or you can download the sqlite database and use that locally.

The code for the scraper is available on GitHub. You can see the output from the scraper on morph.io morph.io/chrismytton/pitchfork_scraper. Note that you’ll need to sign in with GitHub in order to access and manipulate the data over the API.

This article should give you enough background to start hosting your scrapers on morph.io. In my opinion it’s an awesome service that takes the hassle out of running and maintaining scrapers and leaves you to concentrate on the unique parts of your application.

Go forth and get structured data out of the web!

  1. Alternatively scraper.py for Python, scraper.php for PHP or scraper.pl for Perl 

Advanced web scraping with Mechanize

In my last post I gave a basic introduction to web scraping with Ruby and Nokogiri. At the end of that post I mentioned that for more “advanced” scraping Mechanize was worth looking into.

This post explains how to do some more advanced web scraping using Mechanize, which builds on top of Nokogiri’s excellent HTML processing support.

Scraping Pitchfork reviews

Mechanize provides an out-of-the-box scraping solution that can handle filling in forms, following links and respecting a site’s robots.txt file. Here I’ll show you how it can be used to scrape the latest reviews from Pitchfork 1.

Reviews are spread across multiple pages, so we can’t simply fetch a single page and parse it with Nokogiri. This is where Mechanize can help with its ability to click on links and follow them to other pages.

Setup

First we’ll need to install Mechanize and its dependencies from Rubygems.

$ gem install mechanize

With Mechanize installed we can now start writing our scraper. Create a file called scraper.rb and add the following require statements. These specify the dependencies we need for this script. date and json are part of Ruby’s standard library, so there’s no need to install them separately.

require 'mechanize'
require 'date'
require 'json'

Now we can start using Mechanize. First thing we need to do is create a new instance of Mechanize (agent) and then use it to fetch a remote webpage (page).

agent = Mechanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")

Now we can use the page object to find links to reviews. Mechanize provides a .links_with method which, as the name suggests, finds links with the given attributes. Here we look for links which match a regular expression.

This returns an array of links, but we only want links to reviews, not pagination. To remove unwanted links we can call .reject on the array of links and reject any which look like pagination links.

review_links = page.links_with(href: %r{^/reviews/albums/\w+})

review_links = review_links.reject do |link|
  parent_classes = link.node.parent['class'].split
  parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end

For the purposes of demonstration—and so we don’t completely hammer Pitchfork’s server’s—we’ll just take the first four review links.

review_links = review_links[0...4]

Process each review

We now have a list of Mechanize links which we want to map to the reviews that they link to. Since they’re in an array we can call .map on it and return a hash from each iteration.

The Mechanize page object has a .search method which delegates to Nokogiri’s .search method. This means that we can use a CSS selector as an argument to .search and it will return an array of matching elements.

Here we first get the review metadata using the CSS selector #main .review-meta .info and then search inside the review_meta element for the various bits of information that we need.

reviews = review_links.map do |link|
  review = link.click
  review_meta = review.search('#main .review-meta .info')
  artist = review_meta.search('h1')[0].text
  album = review_meta.search('h2')[0].text
  label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
  reviewer = review_meta.search('h4 address')[0].text
  review_date = Date.parse(review_meta.search('.pub-date')[0].text)
  score = review_meta.search('.score').text.to_f
  {
    artist: artist,
    album: album,
    label: label,
    year: year,
    reviewer: reviewer,
    review_date: review_date,
    score: score
  }
end

Now we’ve got an array of review hashes we can output the reviews in JSON format.

puts JSON.pretty_generate(reviews)

All together now

Here’s the whole script:

require 'mechanize'
require 'date'
require 'json'

agent = Mechanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")

review_links = page.links_with(href: %r{^/reviews/albums/\w+})

review_links = review_links.reject do |link|
  parent_classes = link.node.parent['class'].split
  parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end

review_links = review_links[0...4]

reviews = review_links.map do |link|
  review = link.click
  review_meta = review.search('#main .review-meta .info')
  artist = review_meta.search('h1')[0].text
  album = review_meta.search('h2')[0].text
  label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
  reviewer = review_meta.search('h4 address')[0].text
  review_date = Date.parse(review_meta.search('.pub-date')[0].text)
  score = review_meta.search('.score').text.to_f
  {
    artist: artist,
    album: album,
    label: label,
    year: year,
    reviewer: reviewer,
    review_date: review_date,
    score: score
  }
end

puts JSON.pretty_generate(reviews)

Put this code in a file called scraper.rb and run it with the following.

$ ruby scraper.rb

And it should output something like this:

[
  {
    "artist": "Viet Cong",
    "album": "Viet Cong",
    "label": "Jagjaguwar",
    "year": "2015",
    "reviewer": "Ian Cohen",
    "review_date": "2015-01-22",
    "score": 8.5
  },
  {
    "artist": "Lupe Fiasco",
    "album": "Tetsuo & Youth",
    "label": "Atlantic / 1st and 15th",
    "year": "2015",
    "reviewer": "Jayson Greene",
    "review_date": "2015-01-22",
    "score": 7.2
  },
  {
    "artist": "The Go-Betweens",
    "album": "G Stands for Go-Betweens: Volume 1, 1978-1984",
    "label": "Domino",
    "year": "2015",
    "reviewer": "Douglas Wolk",
    "review_date": "2015-01-22",
    "score": 8.2
  },
  {
    "artist": "The Sidekicks",
    "album": "Runners in the Nerved World",
    "label": "Epitaph",
    "year": "2015",
    "reviewer": "Ian Cohen",
    "review_date": "2015-01-22",
    "score": 7.4
  }
]

If you want, you can save this JSON to a file by redirecting standard out to a file.

$ ruby scraper.rb > reviews.json

Conclusion

This only scratches the surface of Mechanize. One thing I haven’t even touched on is it’s ability to fill in and submit forms. If you’re interested in learning more then I recommend you look at the Mechanize guide and Mechanize examples.

A lot of people commented that my previous post should have just used Mechanize from the off. While I agree that Mechanize is a great tool, for simple tasks like the one I presented, at the time it seemed to me like a bit of an overkill.

However on reflection the fact that Mechanize handles fetching the remote webpage and respects robots.txt files makes me think that, even for non-advanced scraping tasks, Mechanize will often be the best tool for the job.

  1. You should always scrape responsibly. Check out the Is scraping legal? blog post from ScraperWiki for more discussion on the subject. 

Web Scraping with Ruby

Update Jan 22: Check out the next post in this series: Advanced web scraping with Mechanize.

Scraping the web with Ruby is easier than you might think. Let’s start with a simple example, I want to get a nicely formatted JSON array of objects representing all the showings for my local independent cinema.

First we need a way to download the html page that has all the listings on it. Ruby comes with an http client, Net::HTTP, and it also comes with a nice wrapper around it, open-uri 1. So the first thing we do is grab the html from the remote server.

require 'open-uri'

url = 'http://www.cubecinema.com/programme'
html = open(url)

Great, so we’ve got the page that we want to scrape, now we need to extract some information from it. The best tool for this job is Nokogiri. So we create a new Nokogiri instance using the html we just scraped.

require 'nokogiri'

doc = Nokogiri::HTML(html)

Nokogiri is great because it allows us to query the html using CSS selectors, which, in my opinion, is much simpler than using xpath.

Ok, now we’ve got a document that we can query for the cinema listings. Each individual listing’s html structure is something like the following.

<div class="showing" id="event_7557">
  <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/">
    <img src="/media/diary/thumbnails/montypython2_1.png.500x300_q85_background-%23FFFFFF_crop-smart.jpg" alt="Picture for event Live stand up + Monty Python and the Holy Grail">
  </a>
  <span class="tags"><a href="/programme/view/comedy/" class="tag_comedy">comedy</a> <a href="/programme/view/dvd/" class="tag_dvd">dvd</a> <a href="/programme/view/film/" class="tag_film">film</a> </span>
  <h1>
    <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/">
      <span class="pre_title">Comedy Combo presents</span>
      Live stand up + Monty Python and the Holy Grail
      <span class="post_title">Rare screening from 35mm!</span>
    </a>
  </h1>
  <div class="event_details">
    <p class="start_and_pricing">
      Sat 20 December | 19:30
      <br>
    </p>
    <p class="copy">Brave (and not so brave) Knights of the Round Table! Gain shelter from the vicious chicken of Bristol as we gather to bear witness to this 100% factually accurate retelling ... [<a class="more" href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/">more...</a>]</p>
  </div>
</div>

Processing the html

Each showing has the class .showing, so we can select all the showings on the page and loop over them, processing each one in turn.

showings = []
doc.css('.showing').each do |showing|
  showing_id = showing['id'].split('_').last.to_i
  tags = showing.css('.tags a').map { |tag| tag.text.strip }
  title_el = showing.at_css('h1 a')
  title_el.children.each { |c| c.remove if c.name == 'span' }
  title = title_el.text.strip
  dates = showing.at_css('.start_and_pricing').inner_html.strip
  dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) }
  description = showing.at_css('.copy').text.gsub('[more...]', '').strip
  showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )
end

Lets break down the code above and see what each part is doing.

showing_id = showing['id'].split('_').last.to_i

First we get the showing’s unique id, which is helpfully exposed as part of the html id attribute in the markup. Using square brackets allows us to access attributes of the element, so using the html above as an example the return value of showing['id'] would be "event_7557". We’re only interested in the integer id, so we split the resulting string on the underscore, .split('_') and then take the last element from that array and convert it to an integer, .last.to_i.

tags = showing.css('.tags a').map { |tag| tag.text.strip }

Here we find all the tags for a showing by using the .css method, which returns an array of matching elements. We then map these elements and take the text content and strip it of any excess whitespace. For the html above this would return ["comedy", "dvd", "film"].

title_el = showing.at_css('h1 a')
title_el.children.each { |c| c.remove if c.name == 'span' }
title = title_el.text.strip

The code to get the title is a bit more involved because the title element in the html contains some extra spans with a prefix and a suffix. First we get the title element using .at_css, which returns a single matching element. Then we loop over the children of the title element and remove any spans. Finally with the spans gone we get the text of the title element and strip out any excess whitespace.

dates = showing.at_css('.start_and_pricing').inner_html.strip
dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) }

This is the code for getting the date and time of a showing. It’s a bit involved because a showing can be on multiple days, and sometimes there is also pricing information in the same element. We’re mapping the dates that we find to DateTime.parse so that the result is an array of ruby DateTime objects.

description = showing.at_css('.copy').text.gsub('[more...]', '').strip

Getting the description is quite straightforward, the only real processing we have to do is remove the [more...] text using .gsub.

showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )

With all the bits of the showing that we want in variables we can now push a hash representing the showing into our array of showings.

Output JSON

Now we’ve processed each showing and we’ve got an array of showings we can convert the result to JSON.

require 'json'

puts JSON.pretty_generate(showings)

This prints out the JSON encoded version of the showings, when running the script the output can be redirected to a file, or piped into another program for further processing.

Putting it all together

With all the pieces in place we can now put the full version of the script together.

require 'open-uri'
require 'nokogiri'
require 'json'

url = 'http://www.cubecinema.com/programme'
html = open(url)

doc = Nokogiri::HTML(html)
showings = []
doc.css('.showing').each do |showing|
  showing_id = showing['id'].split('_').last.to_i
  tags = showing.css('.tags a').map { |tag| tag.text.strip }
  title_el = showing.at_css('h1 a')
  title_el.children.each { |c| c.remove if c.name == 'span' }
  title = title_el.text.strip
  dates = showing.at_css('.start_and_pricing').inner_html.strip
  dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) }
  description = showing.at_css('.copy').text.gsub('[more...]', '').strip
  showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )
end

puts JSON.pretty_generate(showings)

If you save the above into a file called scraper.rb and run it with ruby scraper.rb then you should see the JSON representation of the events printed to stdout. It will look something like the following.

[
  {
    "id": 7686,
    "title": "Harry Dean Stanton - Partly Fiction",
    "tags": [
      "dcp",
      "film",
      "ttt"
    ],
    "dates": [
      "2015-01-19T20:00:00+00:00",
      "2015-01-20T20:00:00+00:00"
    ],
    "description": "A mesmerizing, impressionistic portrait of the iconic actor in his intimate moments, with film clips from some of his 250 films and his own heart-breaking renditions of American folk songs. ..."
  },
  {
    "id": 7519,
    "title": "Bang the Bore Audiovisual Spectacle: VA AA LR + Stephen Cornford + Seth Cooke",
    "tags": [
      "music"
    ],
    "dates": [
      "2015-01-21T20:00:00+00:00"
    ],
    "description": "An evening of hacked TVs, 4 screen cinematic drone and electroacoustics. VAAALR: Vasco Alves, Adam Asnan and Louie Rice create spectacles using distress flares, C02 and junk electronics. Stephen Cornford: ..."
  }
]

And that’s it! This is just a basic example of scraping. Things get a bit more complicated if the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize, which builds on top of Nokogiri.

Hopefully this introduction to scraping has given you some ideas for data that you want to turn into a more structured format using the scraping techniques described above.

  1. While good for basic tasks like this, open-uri has some issues which mean you may want to look elsewhere for an http client to use in production. 

Banana Bread

My minimal banana bread recipe. This is very much at the bread end of the bread/cake spectrum, it’s not very sweet other than the sweetness from the banana. If you want more sweetness then you can add some sugar, soft brown if possible, up to 100g added after the butter. Recipe can be used as a base and then added to, for example after mixing in the flour you could add a handful of walnuts or raisins, some cinnamon, vanilla essence, honey, peanut butter etc, anything that might go with banana really!

Ingredients

  • 2 eggs
  • 100g butter
  • 2 ripe bananas
  • 200g self-raising flour

Method

  1. Preheat the oven to 180C (160C fan).
  2. Grease a loaf tin roughly 21x9x7cm and line with baking parchment
  3. Put the eggs into a large bowl and whisk them
  4. Melt the butter then combine with the eggs and mix thoroughly.
  5. Mash the bananas and fold them into the butter and egg mixture.
  6. Add the flour and mix until just combined. If the mixture is too loose then add more flour.
  7. Spoon into the tin and bake for about 40 minutes until a skewer inserted into the middle comes out clean.
  8. Cool in the tin for 10 minutes before turning out on to a rack to cool completely.

Beer

A couple of weeks ago we brewed our first ever batch of beer. Rather than start with something simple like an extract brew, I decided to create my own all-grain American IPA recipe. The beer is ready to be bottled now, then it will need another 2 weeks in the bottles for conditioning. By mid-January we will finally be able to taste it, hopefully it will at least resemble drinkable beer.

subscribe via RSS