Html 如何从shell脚本中的html表中提取数据?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6854586/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract data from html table in shell script?
提问by Marko
I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:
我正在尝试创建一个 BASH 脚本,它可以从 HTML 表中提取数据。以下是我需要从中提取数据的表格示例:
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>
And I want the BASH script to output it like so:
我希望 BASH 脚本像这样输出它:
SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s
How to do it?
怎么做?
So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>, so only lines starting with <tr><td>will be selected for next parsing (sed).
This is what I used: sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'But then <tr>tags still remain and also it wont separate the strings. In other words the result of this script is:
到目前为止,我已经尝试使用 sed,但我不知道如何很好地使用它。我使用 grep 排除了表的标题(组件、状态、时间/错误)grep "<tr><td>,因此只有以 开头的行<tr><td>才会被选择用于下一次解析(sed)。这就是我使用的:sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'但是<tr>标签仍然存在并且它也不会分开字符串。换句话说,这个脚本的结果是:
<tr>SAVE_DOCUMENTOK0.406 s</tr>
The full command of the script I'm working on is:
我正在处理的脚本的完整命令是:
cat $FILENAME | grep "<tr><td>" | sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</>@@g'
回答by Zsolt Botykai
Go with (g)awk, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.
去吧(g)awk,它有能力:-),这是一个解决方案,但请注意:它只适用于您发布的确切 html 表格格式。
awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print , , }' FILE
Here you can see it in action: https://ideone.com/zGfLe
在这里你可以看到它的实际效果:https: //ideone.com/zGfLe
Some explanation:
一些解释:
-Fsets the input field separator to a regexp (any oftr's ortd's opening or closing tagthen works only on lines that matches those tags AND at least two upercasse fields
then prints the needed fields.
-F将输入字段分隔符设置为正则表达式(任何tr's 或td's 的开始或结束标记然后仅适用于匹配这些标签和至少两个大写字段的行
然后打印所需的字段。
HTH
HTH
回答by Emiliano Poggi
You can use bash xpath(XML::XPathperl module) to accomplish that task very easily:
您可以使用 bash xpath( XML::XPathperl 模块) 非常轻松地完成该任务:
xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'
回答by kenorb
You may use html2textcommand and format the columns via column, e.g.:
您可以使用html2text命令并通过 格式化列column,例如:
$ html2text table.html | column -ts'|'
Component Status Time / Error
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
then parse it further from there (e.g. cut, awk, ex).
然后从那里进一步分析它(例如cut,awk,ex)。
In case you'd like to sort it first, you can use ex, see the example hereor here.
回答by mu is too short
There are a lot of ways of doing this but here's one:
有很多方法可以做到这一点,但这里有一个:
grep '^<tr><td>' < $FILENAME \
| sed \
-e 's:<tr>::g' \
-e 's:</tr>::g' \
-e 's:</td>::g' \
-e 's:<td>: :g' \
| cut -c2-
You could use more sed(1)(-e 's:^ ::') instead of the cut -c2-to remove the leading space but cut(1)doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.
您可以使用更多sed(1)( -e 's:^ ::') 而不是 thecut -c2-来删除前导空格,但cut(1)并没有得到应有的爱。反斜杠只是用于格式化,您可以删除它们以获得单行或保留它们并确保它们后面紧跟换行符。
The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.
基本策略是慢慢地将 HTML 一块一块地分开,而不是试图用一堆难以理解的正则表达式语法一次性完成。
Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.
使用 shell 管道解析 HTML 并不是最好的主意,但如果已知 HTML 以非常特定的格式出现,您就可以这样做。如果会有变化,那么最好使用 Perl、Ruby、Python 甚至 C 中的真正 HTML 解析器。
回答by mklement0
A solution based on multi-platform web-scraping CLI xideland XQuery:
基于多平台网页抓取 CLIxidel和XQuery 的解决方案:
xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file
With the sample input, this yields:
使用样本输入,这会产生:
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Explanation:
解释:
XQuery query
for $tr in //tr[position()>1] return join($tr/td, " ")processes thetrelements starting with the 2nd one (position()>1, to skip the header row) in a loop, and joins the values of the childtdelements ($tr/td) with a single space as the separator.-smakesxidelsilent (suppresses output of status information).
XQuery 查询循环
for $tr in //tr[position()>1] return join($tr/td, " ")处理tr从第二个元素(position()>1,跳过标题行)开始的元素,并使用单个空格作为分隔符连接子td元素 ($tr/td)的值。-s使xidel静音(抑制状态信息的输出)。
While html2textis convenient for displayof the extracted data, providing machine-parseable output is non-trivial, unfortunately:
虽然html2text方便显示提取的数据,但提供机器可解析的输出并非易事,不幸的是:
html2text file | awk -F' *\|' 'NR>2 {gsub(/^\||.\b/, ""); =; print}'
The Awk command removes the hidden \b-based (backspace-based) sequences that html2textoutputs by default, and parses the lines into fields by |, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t').
awk命令去掉默认输出的hidden \b-based(backspace-based)序列,html2text将行解析成字段by |,然后输出,以空格为分隔符(空格是awk默认的输出字段分隔符;改到选项卡,例如,使用-v OFS='\t')。
Note: Use of -nobsto suppress backspace sequences at the source is notan option, because you then won't be able to distinguish between the hidden-by-default _instances used for padding and actual _characters in the data.
注意:使用 of-nobs在源处抑制退格序列不是一种选择,因为这样您将无法区分_用于填充的默认隐藏实例和_数据中的实际字符。
Note: Given that html2textseemingly invariably uses |as the column separator, the above will only work robustly if the are no |instances in the databeing extracted.
注意:鉴于html2text似乎总是|用作列分隔符,只有|在被提取的数据中没有实例时,上述内容才会有效。
回答by kenorb
You can parse the file using Ex editor(part of Vim) by removing HTML tags, e.g.:
您可以使用Ex 编辑器(Vim 的一部分)通过删除 HTML 标签来解析文件,例如:
$ ex -s +'%s/<[^>]\+>/ /g' +'v/0/d' +'wq! /dev/stdout' table.html
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Here is shorter version by printing the whole file without HTML tags:
这是通过打印没有 HTML 标签的整个文件的较短版本:
$ ex +'%s/<[^>]\+>/ /g|%p' -scq! table.html
Explanation:
解释:
%s/<[^>]\+>/ /g- Substitute all HTML tags into empty space.v/0/d- Deletes all lines without0.wq! /dev/stdout- Quits editor and writes the buffer to the standard output.
%s/<[^>]\+>/ /g-小号ubstitute所有的HTML标记为空的空间。v/0/d- deletes没有所有行0。wq! /dev/stdout- QUITS编辑和w ^仪式缓冲区到标准输出。

