bash 使用 shell 解析 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25358698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 21:38:01  来源:igfitidea点击:

Parse HTML using shell

bashshellparsingawk

提问by Lenny

I have a HTML with lots of data and part I am interested in:

我有一个包含大量数据和我感兴趣的部分的 HTML:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awkwhich now is:

我尝试使用awk现在是:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print , ,  }' "index.html"

but what I want is to have:

但我想要的是:

54
1
0
0

Right now I am getting:

现在我得到:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

有什么建议?

回答by hek2mgl

awkis not an HTML parser. Use xpathor even xsltfor that. xmllintis a commandline tool which is able to execute XPath queries and xsltproccan be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

awk不是 HTML 解析器。使用xpath甚至xslt为此。xmllint是一个命令行工具,它能够执行 XPath 查询xsltproc并可用于执行 XSL 转换。这两个工具都属于这个包libxml2-utils

Also you can use a programming language which is able to parse HTML

您也可以使用能够解析 HTML 的编程语言

回答by konsolebox

awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", ); print  } ' file

Output:

输出:

54
1
0
0

Another:

其他:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", )
        print 
    }
    exit
}' file

回答by jm666

You really should to use some real HTML parser for this job, like:

你真的应该为这项工作使用一些真正的 HTML 解析器,比如:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

印刷:

54
1
0
0

But for this you need to have perl, and installed Mojolicious package.

但是为此你需要有 perl,并安装 Mojolicious 包。

(it is easy to install with:)

(很容易安装:)

curl -L get.mojolicio.us | sh

回答by Ed Morton

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '~/[0-9]/{print +0}' file
54
1
0
0

回答by kenorb

BSD/GNUgrep/ripgrep

BSD/ GNUgrep/ripgrep

For simple extracting, you can use grep, for example:

对于简单的提取,您可以使用grep,例如:

  • Your example using grep:

    $ egrep -o "[0-9][^<]\?\+" file.html
    54
    1
    0 (0/0)
    0
    

    and using ripgrep:

    $ rg -o ">([^>]+)<" -r '' <file.html | tail +2
    54
    1
    0 (0/0)
    0
    
  • Extracting outer html of H1:

    $ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    
  • 您使用的示例grep

    $ egrep -o "[0-9][^<]\?\+" file.html
    54
    1
    0 (0/0)
    0
    

    并使用ripgrep

    $ rg -o ">([^>]+)<" -r '' <file.html | tail +2
    54
    1
    0 (0/0)
    0
    
  • 提取H1的外部html:

    $ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    

Other examples:

其他例子:

  • Extracting the body:

    $ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    Instead of xargsyou can also use tr '\n' ' '.

  • For multiple tags, see: Text between two tags.

  • 提取身体:

    $ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    相反的xargs,你也可以使用tr '\n' ' '

  • 对于多个标签,请参阅:两个标签之间的文本

If you're dealing with large datasets, consider using ripgrepwhich has similar syntax, but it's a way faster since it's written in Rust.

如果您正在处理大型数据集,请考虑使用ripgrep具有类似语法的 which,但由于它是用Rust编写的,因此速度更快。

回答by greyfade

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

我最近被指出pup,在我所做的有限测试中,它对无效的 HTML 和标签汤更加宽容。

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

印刷:

54
1
0 (0/0)
0

回答by kenorb

HTML-XML-utils

HTML-XML-utils

You may use htmlutilsfor parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

您可以htmlutils用于解析格式良好的 HTML/XML 文件。该软件包包含许多用于提取或修改数据的二进制工具。例如:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>


Here is the example with provided data:

以下是提供数据的示例:

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b>tags:

这是剥离<b>标签的最后一个例子:

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0


For more examples, check the html-xml-utils.

有关更多示例,请查看html-xml-utils

回答by kenorb

ex/vim

ex/vim

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

对于更高级的解析,您可以使用就地编辑器,例如 ex/vi,您可以其中跳转匹配的 HTML 标签、选择/删除内部/外部标签以及就地编辑内容。

Here is the command:

这是命令:

$ ex +"%s/^[^>].*>\([^<]\+\)<.*//g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

这是命令的工作方式:

  • Use exin-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".

    The substitution pattern consists of 3 parts:

    • Select from the beginning of line till >(^[^>].*>) for removal, right before the 2nd part.
    • Select our main part till <(([^<]+)).
    • Select everything else after <for removal (<.*).
    • We replace the whole matching line with \1which refers to pattern inside the brackets (()).
  • After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.

  • Finally, print the current buffer on the screen by +%p.
  • Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").
  • 使用ex就地编辑器将所有行 ( %)替换为:ex +"%s/pattern/replace/g"

    替换模式由 3 部分组成:

    • 选择从行首直到>( ^[^>].*>) 进行删除,就在第二部分之前。
    • 选择我们的主要部分直到<( ([^<]+))。
    • 选择<删除后的所有内容( <.*)。
    • 我们用\1括号 ( ())内的引用模式替换整个匹配行。
  • 替换后,我们使用global:删除任何字母数字行g/[a-zA-Z]/d

  • 最后,通过 将当前缓冲区打印在屏幕上+%p
  • 然后静默 ( -s) 退出而不保存 ( -c "q!"),或保存到文件中 ( -c "wq")。

When tested, to replace file in-place, change -scq!to -scwq.

测试后,要就地替换文件,请更改-scq!-scwq.



Here is another simple example which removes style tag from the header and prints the parsed output:

这是另一个简单的示例,它从标题中删除样式标记并打印解析的输出:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advisedto use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perlor PHP DOM).

但是,不建议使用正则表达式来解析您的 html,因此对于长期方法,您应该使用适当的语言(例如Python、perlPHP DOM)。



See also:

也可以看看: