bash 使用 shell 解析 HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25358698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse HTML using shell
提问by Lenny
I have a HTML with lots of data and part I am interested in:
我有一个包含大量数据和我感兴趣的部分的 HTML:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk
which now is:
我尝试使用awk
现在是:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print , , }' "index.html"
but what I want is to have:
但我想要的是:
54
1
0
0
Right now I am getting:
现在我得到:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?
有什么建议?
回答by hek2mgl
awk
is not an HTML parser. Use xpath
or even xslt
for that. xmllint
is a commandline tool which is able to execute XPath queries and xsltproc
can be used to perform XSL transformations. Both tools belong to the package libxml2-utils
.
awk
不是 HTML 解析器。使用xpath
甚至xslt
为此。xmllint
是一个命令行工具,它能够执行 XPath 查询xsltproc
并可用于执行 XSL 转换。这两个工具都属于这个包libxml2-utils
。
Also you can use a programming language which is able to parse HTML
您也可以使用能够解析 HTML 的编程语言
回答by konsolebox
awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", ); print } ' file
Output:
输出:
54
1
0
0
Another:
其他:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", )
print
}
exit
}' file
回答by jm666
You really should to use some real HTML parser for this job, like:
你真的应该为这项工作使用一些真正的 HTML 解析器,比如:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
印刷:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
但是为此你需要有 perl,并安装 Mojolicious 包。
(it is easy to install with:)
(很容易安装:)
curl -L get.mojolicio.us | sh
回答by Ed Morton
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '~/[0-9]/{print +0}' file
54
1
0
0
回答by kenorb
BSD/GNUgrep
/ripgrep
BSD/ GNUgrep
/ripgrep
For simple extracting, you can use grep
, for example:
对于简单的提取,您可以使用grep
,例如:
Your example using
grep
:$ egrep -o "[0-9][^<]\?\+" file.html 54 1 0 (0/0) 0
and using
ripgrep
:$ rg -o ">([^>]+)<" -r '' <file.html | tail +2 54 1 0 (0/0) 0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>' <h1>Example Domain</h1>
您使用的示例
grep
:$ egrep -o "[0-9][^<]\?\+" file.html 54 1 0 (0/0) 0
并使用
ripgrep
:$ rg -o ">([^>]+)<" -r '' <file.html | tail +2 54 1 0 (0/0) 0
提取H1的外部html:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>' <h1>Example Domain</h1>
Other examples:
其他例子:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>' <body> <div> <h1>Example Domain</h1> ...
Instead of
xargs
you can also usetr '\n' ' '
.For multiple tags, see: Text between two tags.
提取身体:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>' <body> <div> <h1>Example Domain</h1> ...
相反的
xargs
,你也可以使用tr '\n' ' '
。对于多个标签,请参阅:两个标签之间的文本。
If you're dealing with large datasets, consider using ripgrep
which has similar syntax, but it's a way faster since it's written in Rust.
如果您正在处理大型数据集,请考虑使用ripgrep
具有类似语法的 which,但由于它是用Rust编写的,因此速度更快。
回答by greyfade
I was recently pointed to pup
, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
我最近被指出pup
,在我所做的有限测试中,它对无效的 HTML 和标签汤更加宽容。
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
印刷:
54
1
0 (0/0)
0
回答by kenorb
HTML-XML-utils
HTML-XML-utils
You may use htmlutils
for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
您可以htmlutils
用于解析格式良好的 HTML/XML 文件。该软件包包含许多用于提取或修改数据的二进制工具。例如:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
以下是提供数据的示例:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b>
tags:
这是剥离<b>
标签的最后一个例子:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.
有关更多示例,请查看html-xml-utils。
回答by kenorb
ex
/vim
ex
/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
对于更高级的解析,您可以使用就地编辑器,例如 ex/vi,您可以在其中跳转匹配的 HTML 标签、选择/删除内部/外部标签以及就地编辑内容。
Here is the command:
这是命令:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*//g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
这是命令的工作方式:
Use
ex
in-place editor to substitute on all lines (%
) by:ex +"%s/pattern/replace/g"
.The substitution pattern consists of 3 parts:
- Select from the beginning of line till
>
(^[^>].*>
) for removal, right before the 2nd part. - Select our main part till
<
(([^<]+)
). - Select everything else after
<
for removal (<.*
). - We replace the whole matching line with
\1
which refers to pattern inside the brackets (()
).
- Select from the beginning of line till
After substitution, we remove any alphanumeric lines by using global:
g/[a-zA-Z]/d
.- Finally, print the current buffer on the screen by
+%p
. - Then silently (
-s
) quit without saving (-c "q!"
), or save into the file (-c "wq"
).
使用
ex
就地编辑器将所有行 (%
)替换为:ex +"%s/pattern/replace/g"
。替换模式由 3 部分组成:
- 选择从行首直到
>
(^[^>].*>
) 进行删除,就在第二部分之前。 - 选择我们的主要部分直到
<
(([^<]+)
)。 - 选择
<
删除后的所有内容(<.*
)。 - 我们用
\1
括号 (()
)内的引用模式替换整个匹配行。
- 选择从行首直到
替换后,我们使用global:删除任何字母数字行
g/[a-zA-Z]/d
。- 最后,通过 将当前缓冲区打印在屏幕上
+%p
。 - 然后静默 (
-s
) 退出而不保存 (-c "q!"
),或保存到文件中 (-c "wq"
)。
When tested, to replace file in-place, change -scq!
to -scwq
.
测试后,要就地替换文件,请更改-scq!
为-scwq
.
Here is another simple example which removes style tag from the header and prints the parsed output:
这是另一个简单的示例,它从标题中删除样式标记并打印解析的输出:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advisedto use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perlor PHP DOM).
但是,不建议使用正则表达式来解析您的 html,因此对于长期方法,您应该使用适当的语言(例如Python、perl或PHP DOM)。
See also:
也可以看看: