bash 如何使用 sed、awk 或 grep 从 HTML 表格单元格中提取数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19392114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 08:19:07  来源:igfitidea点击:

How can I extract data from HTML table cells using sed, awk, or grep?

htmlregexbashsedawk

提问by Corey Stadnyk

I have a cURL Bash script that goes to a website and posts data, then returns that to a text file. The text file comes back all in HTML and I cant figure out how to extract the information I need from it. Here is the HTML from Info.txt:

我有一个 cURL Bash 脚本,可以访问网站并发布数据,然后将其返回到文本文件。文本文件全部以 HTML 格式返回,我不知道如何从中提取我需要的信息。这是来自Info.txt的 HTML :

<table cellspacing="1" cellpadding="0" border="0">
<tr><td><img src="/themes/img/status/green.gif" width="12" height="12" border="0"/></td><td><font class="small"><i>October 15, 2013 @ 1:34pm (PST)</i></font></td></tr>
<tr><td><font class="small">MF:&nbsp;&nbsp;</font></td><td><font class="small">PSVBHP9001230079779201</font></td></tr>
<tr><td><font class="small">SN:&nbsp;&nbsp;</font></td><td><font class="small">1354716309166</font></td></tr>
<tr><td><font class="small">ID:&nbsp;&nbsp;</font></td><td><font class="small">800.10</font></td></tr>
</table>

I need to extract these 3 values:

我需要提取这 3 个值:

  • PSVBHP9001230079779201
  • 1354716309166
  • 800.10
  • PSVBHP9001230079779201
  • 1354716309166
  • 800.10

I have tried this using grep, but have not had much success. I can't seem to figure out how to extract just the values I want. I have tried multiple sed and awk commands as well but the closest I could come is with this grep command:

我已经使用 grep 尝试过这个,但没有取得太大的成功。我似乎无法弄清楚如何仅提取我想要的值。我也尝试了多个 sed 和 awk 命令,但最接近的是这个 grep 命令:

$ grep -o '[^ ]*.PSV[^ ]*' Info.txt
<tr><td><font>PSVBHP9001230079779201</font></td></tr>

回答by Todd A. Jacobs

Parse HTML, Don't Grep It

解析 HTML,不要 Grep

Sometimes you can get away with grepping HTML if:

有时,如果出现以下情况,您可以使用 grepping HTML:

  1. you know the input format will remain consistent, and
  2. your data is very regular.
  1. 您知道输入格式将保持一致,并且
  2. 你的数据很规律。

Your corpus doesn't seem to fit these criteria, so use an HTML or XML parser instead for best results.

您的语料库似乎不符合这些标准,因此请改用 HTML 或 XML 解析器以获得最佳结果。

Use Nokogiri

使用Nokogiri

Ruby's Nokogirigem and XPathselectors make quick work of this. For example:

Ruby 的Nokogirigem 和XPath选择器可以快速解决这个问题。例如:

require 'nokogiri'
doc = Nokogiri::HTML(File.read '/tmp/info.txt');
doc.xpath('//td[2]').map(&:content).reject { |e| e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]

This will select the second cell from each row and discard any results with a colon. If you aren't sure that the field you want will always be in the second cell, then your corpus will also match properly with this alternative:

这将从每一行中选择第二个单元格并丢弃任何带有冒号的结果。如果您不确定您想要的字段是否总是在第二个单元格中,那么您的语料库也将与此替代方案正确匹配:

doc.xpath('//td').map(&:content).reject { |e| e.empty? or e.include? ':' }
#=> ["PSVBHP9001230079779201", "1354716309166", "800.10"]

You can certainly adjust the selectors to match any changes to your corpus, or store the results in a variable so you can refine the results after the parser returns candidate fields. The sky's the limit, but this should be enough to get you started.

您当然可以调整选择器以匹配对您的语料库的任何更改,或者将结果存储在变量中,以便您可以在解析器返回候选字段后优化结果。天无绝人之路,但这应该足以让您开始。

回答by Ed Morton

$ awk -F'[<>]' '/<tr><td><font/{print }' file
PSVBHP9001230079779201
1354716309166
800.10

回答by Todd A. Jacobs

Use the XML2 Suite

使用 XML2 套件

While parsing HTMLis the canonically-correct solution, you certainly have other options. One of those options is to convert the HTML into a flat format that can be filtered or split with the tools of your choice. PYX notationand the intuitive but undocumented format used by xml2 toolsare two ways to represent an HTML document in a line-oriented format. For this use case, I recommend the latter.

虽然解析 HTML是规范正确的解决方案,但您当然还有其他选择。这些选项之一是将 HTML 转换为可以使用您选择的工具过滤或拆分的平面格式。PYX 表示法xml2 工具使用的直观但未记录的格式是以面向行的格式表示 HTML 文档的两种方式。对于这个用例,我推荐后者。

An Example of Flattened HTML

扁平化 HTML 的示例

Given your posted corpus, the following will work with the html2 utility from the xml2 package:

鉴于您发布的语料库,以下内容将与 xml2 包中的 html2 实用程序一起使用:

$ html2 < /tmp/info.txt | fgrep /td/ | egrep -v '[:@]' | cut -d= -f2
PSVBHP9001230079779201
1354716309166
800.10

This works by:

这通过以下方式工作:

  1. transforming the HTML into a line-oriented representation,
  2. selecting table cells with a fixed-string grep,
  3. removing attributes and lines containing a colon with an extended regular expression, and
  4. selecting the node value with cut.
  1. 将 HTML 转换为面向行的表示,
  2. 使用固定字符串 grep 选择表格单元格,
  3. 使用扩展正则表达式删除包含冒号的属性和行,以及
  4. 用 cut 选择节点值。

Flattening HTML is obviously a bit of a hack, and the recipe may require additional filtering to fit your real corpus. On the other hand, it works well from the command line and doesn't require any deep knowledge of the document type definition, document object model, or XPath. It also leverages your knowledge of core utilities like sed, grep, awk, cut, and so on.

扁平化 HTML 显然有点小技巧,配方可能需要额外的过滤以适应您的真实语料库。另一方面,它在命令行中运行良好,不需要对文档类型定义文档对象模型XPath 有任何深入了解。它还利用您对核心实用程序(如 sed、grep、awk、cut 等)的了解。

Your mileage may vary.

你的旅费可能会改变。