使用 Bash 获取一对 HTML 标签之间的内容

Question

提问by Joao

I need to get the HTML contents between a pair of given tags using a bash script. As an example, having the HTML code below:

我需要使用 bash 脚本获取一对给定标签之间的 HTML 内容。例如，具有以下 HTML 代码：

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>
</html>

Using the bash command/script, given the bodytag, we would get:

使用 bash 命令/脚本，给定body标签，我们将得到：

 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>

Thanks in advance.

提前致谢。

Answer 1

采纳答案by Kent

plain text processing is not good for html/xml parsing. I hope this could give you some idea:

纯文本处理不利于 html/xml 解析。我希望这可以给你一些想法：

kent$  xmllint --xpath "//body" f.html 
<body>
 text
  <div>
  text2
    <div>
        text3
    </div>
  </div>
</body>

Answer 2

回答by BMW

Using sed in shell/bash, so you needn't install something else.

在 shell/bash 中使用 sed，因此您无需安装其他东西。

tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Answer 3

回答by Cromax

Personally I find it very useful to use hxselectcommand (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -coption, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:

就我个人而言，我发现使用html-xml-utils 包中的hxselect命令（通常在的帮助下hxclean）非常有用。后者修复（有时损坏）HTML 文件以更正 XML 文件，第一个允许使用 CSS 选择器来获取您需要的节点。使用该-c选项，它会去除周围的标签。所有这些命令都适用于标准输入和标准输出。因此，在您的情况下，您应该执行：

$ hxselect -c body <<HTML
  <html>
  <head>
  </head>
  <body>
    text
    <div>
      text2
      <div>
        text3
      </div>
    </div>
  </body>
  </html>
  HTML

to get what you need. Plain and simple.

得到你需要的东西。干净利落。

Answer 4

回答by mklement0

Another option is to use the multi-platform xidelutility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:

另一种选择是使用多平台xidel实用程序（SourceForge 上的主页，GitHub 存储库），它可以处理 XML 和 HTML：

xidel in.html  -e '/html/body/node()' --printed-node-format=html

Answer 5

回答by Paulo Fidalgo

Forgetting Bash due it's limitation, you can use nokogirias command line util, as explained here.

忘记 Bash 由于它的限制，你可以使用nokogiri作为命令行工具，如解释here。

Example:

例子：

curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'

Answer 6

回答by Aaron Digulla

BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Souplibrary instead.

BASH 可能是错误的工具。尝试使用强大的Beautiful Soup库代替Python 脚本。

It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.

前期工作会更多，但从长远来看（此处：一小时后），节省的时间将弥补额外的工作量。

使用 Bash 获取一对 HTML 标签之间的内容

提问by Joao

采纳答案by Kent

回答by BMW

回答by Cromax

回答by mklement0

回答by Paulo Fidalgo

回答by Aaron Digulla

相关推荐

最近更新

标签

使用 Bash 获取一对 HTML 标签之间的内容

提问by Joao

采纳答案by Kent

回答by BMW

回答by Cromax

回答by mklement0

回答by Paulo Fidalgo

回答by Aaron Digulla

相关推荐

Html 滚动条通过 CSS 动画/过渡出现

Html 通过 CSS 在 Div 中均匀和水平地分布图像

HTML：更改 <p> 的高度

Html 边距顶部不适用于 <p> 和 <a> 标签？

相关推荐

最近更新

标签