Railsメモ(6) : Nokogiriでスクレイピングする

Bootswatchで見た目はよくなったがデータ数が少ないので、ここからはデータを増やす作業をする。
rails consoleで1個1個追加なんてしていられないので、NokogiriでWikipediaのページをスクレイピングした結果をCSVファイルに出力してまとめて追加する。

Nokogiri

RubyでWebページをスクレイピングするならNokogiriを使用するのがよいらしい。
Wikipediaのページからランキング情報をスクレイピングするが、毎回アクセスするのも時間がかかるので下記コマンドでローカルにダウンロードしておく。

$ wget https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2014 -O 2014.html

Nokogiriが動作するか試しにタイトルを抽出してみる。

require 'nokogiri'

f = File.open("2014.html")
doc = Nokogiri::HTML(f)
f.close()

puts doc.title

$ ruby nokogiri_test.rb
Billboard Year-End Hot 100 singles of 2014 - Wikipedia, the free encyclopedia

動作することは確認できたので、ランキング情報を抜き出す方法を検討する。
まずHTMLの構造を把握するためにChromeで右クリック→"要素の検証"からデベロッパーツールを起動する。
虫眼鏡アイコンをクリックしてから1位の"Happy"を選択すると画面下部にHTMLの構造が表示される。どうやらtable.wikitable.sortableが目的のtableらしいので抽出してみる。

f:id:wonder-wall:20150807214245p:plain

NokogiriにはXPathとCSS セレクタがあるが、CSS セレクタの方が記述が簡単らしいのでそちらを使用する。

require 'nokogiri'

f = File.open("2014.html")
doc = Nokogiri::HTML(f)
f.close()

doc.css("table.wikitable.sortable tr").each do |node|
  puts node
end

$ ruby nokogiri_test.rb
<tr>
<th scope="col" style="background:#dde;">
<center>№</center>
</th>
<th scope="col" style="background:#dde;">Title</th>
<th scope="col" style="background:#dde;">Artist(s)</th>
</tr>
<tr>
<th scope="row">1</th>
<td>"<a href="/wiki/Happy_(Pharrell_Williams_song)" title="Happy (Pharrell Williams song)">Happy</a>"</td>
<td><a href="/wiki/Pharrell_Williams" title="Pharrell Williams">Pharrell Williams</a></td>
</tr>
　…省略…
<tr>
<th scope="row">100</th>
<td>"<a href="/wiki/Adore_You" title="Adore You">Adore You</a>"</td>
<td><a href="/wiki/Miley_Cyrus" title="Miley Cyrus">Miley Cyrus</a></td>
</tr>

いい感じに抽出できているようなので、あとはランキングとタイトル、アーティスト名を抽出してCSVに出力する。なおtrタグの1番目はヘッダ情報なのでスキップする。
Wikipediaには1951年からのランキングがあるが、古いのはあまり興味がないのでキリのいいところで1990年からの25年分を抽出してみる。

require 'nokogiri'
require 'csv'

CSV.open("seed_songs.csv", "w") do |csv|
  for year in 1990..2014 do
    f = File.open("#{year}.html")
    doc = Nokogiri::HTML(f)
    f.close()

    doc.css("table.wikitable tr").each_with_index do |node, index|
      next if index == 0
      ranking = node.css("th").inner_text
      title   = node.css("td:eq(1)").inner_text.gsub(/^"/, "").gsub(/"$/, "")
      artist  = node.css("td:eq(2)").inner_text
      csv << [title, artist, ranking, year]
    end
  end
end

$ ruby nokogiri_test.rb
$ wc -l seed_songs.csv
2500 seed_songs.csv
$ head -n 5 seed_songs.csv
Hold On,Wilson Phillips,1,1990
It Must Have Been Love,Roxette,2,1990
Nothing Compares 2 U,Sinéad O'Connor,3,1990
Poison,Bell Biv DeVoe,4,1990
Vogue,Madonna,5,1990
$ tail -n 5 seed_songs.csv
Studio,ScHoolboy Q featuring BJ the Chicago Kid,96,2014
0 to 100 / The Catch Up,Drake,97,2014
I Don't Dance,Lee Brice,98,2014
Somethin' Bad,Miranda Lambert and Carrie Underwood,99,2014
Adore You,Miley Cyrus,100,2014

ちゃんと2,500件抽出できているので、次ステップでこのデータをDBに追加してみる。