如何使用 Nokogiri 解析 HTML 表格?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2062051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 01:50:40  来源:igfitidea点击:

How do I parse an HTML table with Nokogiri?

htmlrubynokogirimechanizehtml-table

提问by Radek

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

我安装了 Ruby 和 Mechanize。在我看来,在 Nokogiri 做我想做的事情是可能的,但我不知道该怎么做。

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

这个table怎么办?它只是 vBulletin 论坛站点 HTML 的一部分。我试图保留 HTML 结构,但删除了一些文本和标签属性。我想获取每个线程的一些详细信息,例如:标题、作者、日期、时间、回复和视图。

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbodyand the namein the code?

请注意,HTML 文档中的表格很少?我正在寻找一张带有tbody, 的特定表<tbody id="threadbits_forum_251">。名称将始终相同(我希望)。我可以在代码中使用tbodyname吗?

<table >
  <tbody>
    <tr>  <!-- table header --> </tr>
  </tbody>
  <!-- show threads -->
  <tbody id="threadbits_forum_251">
    <tr>
      <td></td>
      <td></td>
      <td>
        <div>
          <a href="showthread.php?t=230708" >Vb4 Gold Released</a>
        </div>
        <div>
          <span><a>Paul M</a></span>
        </div>
      </td>
      <td>
          06 Jan 2010 <span class="time">23:35</span><br />
          by <a href="member.php?find=lastposter&amp;t=230708">shane943</a> 
        </div>
      </td>
      <td><a href="#">24</a></td>
      <td>1,320</td>
    </tr>

  </tbody>
</table>

回答by Wayne Conrad

#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]