bash 使用 shell 解析 HTML

Question

提问by Lenny

I have a HTML with lots of data and part I am interested in:

我有一个包含大量数据和我感兴趣的部分的 HTML：

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awkwhich now is:

我尝试使用awk现在是：

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print , ,  }' "index.html"

but what I want is to have:

但我想要的是：

Right now I am getting:

现在我得到：

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

有什么建议？

Answer 1

回答by hek2mgl

awkis not an HTML parser. Use xpathor even xsltfor that. xmllintis a commandline tool which is able to execute XPath queries and xsltproccan be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

awk不是 HTML 解析器。使用xpath甚至xslt为此。xmllint是一个命令行工具，它能够执行 XPath 查询xsltproc并可用于执行 XSL 转换。这两个工具都属于这个包libxml2-utils。

Also you can use a programming language which is able to parse HTML

您也可以使用能够解析 HTML 的编程语言

Answer 2

回答by konsolebox

awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", ); print  } ' file

Output:

输出：

Another:

其他：

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", )
        print 
    }
    exit
}' file

Answer 3

回答by jm666

You really should to use some real HTML parser for this job, like:

你真的应该为这项工作使用一些真正的 HTML 解析器，比如：

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

印刷：

But for this you need to have perl, and installed Mojolicious package.

但是为此你需要有 perl，并安装 Mojolicious 包。

(it is easy to install with:)

（很容易安装：）

curl -L get.mojolicio.us | sh

Answer 4

回答by Ed Morton

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '~/[0-9]/{print +0}' file
54
1
0
0

Answer 5

回答by kenorb

BSD/GNU `grep`/`ripgrep`

BSD/ GNU `grep`/`ripgrep`

For simple extracting, you can use grep, for example:

对于简单的提取，您可以使用grep，例如：

Your example using grep:

$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0

and using ripgrep:

$ rg -o ">([^>]+)<" -r '' <file.html | tail +2
54
1
0 (0/0)
0

Extracting outer html of H1:

$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

您使用的示例grep：

$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0

并使用ripgrep：

$ rg -o ">([^>]+)<" -r '' <file.html | tail +2
54
1
0 (0/0)
0

提取H1的外部html：

$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

Other examples:

其他例子：

Extracting the body:

$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

^{Instead of xargsyou can also use tr '\n' ' '.}

For multiple tags, see: Text between two tags.

提取身体：

$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

^{相反的xargs，你也可以使用tr '\n' ' '。}

对于多个标签，请参阅：两个标签之间的文本。

If you're dealing with large datasets, consider using ripgrepwhich has similar syntax, but it's a way faster since it's written in Rust.

如果您正在处理大型数据集，请考虑使用ripgrep具有类似语法的 which，但由于它是用Rust编写的，因此速度更快。

Answer 6

回答by greyfade

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

我最近被指出pup，在我所做的有限测试中，它对无效的 HTML 和标签汤更加宽容。

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

印刷：

54
1
0 (0/0)
0

Answer 7

回答by kenorb

`HTML-XML-utils`

You may use htmlutilsfor parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

您可以htmlutils用于解析格式良好的 HTML/XML 文件。该软件包包含许多用于提取或修改数据的二进制工具。例如：

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

Here is the example with provided data:

以下是提供数据的示例：

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b>tags:

这是剥离<b>标签的最后一个例子：

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0

For more examples, check the html-xml-utils.

有关更多示例，请查看html-xml-utils。

Answer 8

回答by kenorb

`ex`/`vim`

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

对于更高级的解析，您可以使用就地编辑器，例如 ex/vi，您可以在其中跳转匹配的 HTML 标签、选择/删除内部/外部标签以及就地编辑内容。

Here is the command:

这是命令：

$ ex +"%s/^[^>].*>\([^<]\+\)<.*//g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

这是命令的工作方式：

Use exin-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
- Select from the beginning of line till >(^[^>].*>) for removal, right before the 2nd part.
- Select our main part till <(([^<]+)).
- Select everything else after <for removal (<.*).
- We replace the whole matching line with \1which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").

使用ex就地编辑器将所有行 ( %)替换为：ex +"%s/pattern/replace/g"。
替换模式由 3 部分组成：
- 选择从行首直到>( ^[^>].*>) 进行删除，就在第二部分之前。
- 选择我们的主要部分直到<( ([^<]+))。
- 选择<删除后的所有内容( <.*)。
- 我们用\1括号 ( ())内的引用模式替换整个匹配行。
替换后，我们使用global:删除任何字母数字行g/[a-zA-Z]/d。
最后，通过将当前缓冲区打印在屏幕上+%p。
然后静默 ( -s) 退出而不保存 ( -c "q!")，或保存到文件中 ( -c "wq")。

When tested, to replace file in-place, change -scq!to -scwq.

测试后，要就地替换文件，请更改-scq!为-scwq.

Here is another simple example which removes style tag from the header and prints the parsed output:

这是另一个简单的示例，它从标题中删除样式标记并打印解析的输出：

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advisedto use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perlor PHP DOM).

但是，不建议使用正则表达式来解析您的 html，因此对于长期方法，您应该使用适当的语言（例如Python、perl或PHP DOM）。

bash 使用 shell 解析 HTML

提问by Lenny

回答by hek2mgl

回答by konsolebox

回答by jm666

回答by Ed Morton

回答by kenorb

BSD/GNU `grep`/`ripgrep`

BSD/ GNU `grep`/`ripgrep`

回答by greyfade

回答by kenorb

`HTML-XML-utils`

`HTML-XML-utils`

回答by kenorb

`ex`/`vim`

`ex`/`vim`

相关推荐

最近更新

标签

bash 使用 shell 解析 HTML

提问by Lenny

回答by hek2mgl

回答by konsolebox

回答by jm666

回答by Ed Morton

回答by kenorb

BSD/GNUgrep/ripgrep

BSD/ GNUgrep/ripgrep

回答by greyfade

回答by kenorb

回答by kenorb

相关推荐

相关推荐

最近更新

标签

BSD/GNU `grep`/`ripgrep`

BSD/ GNU `grep`/`ripgrep`