bash 提取部分代码并在bash中解析HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41001475/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 15:30:01  来源:igfitidea点击:

Extract part of the code and parse HTML in bash

bashhtml-parsing

提问by Pavol Travnik

I have external HTML site and I need to extract data from the table on that site. However source of the HTML website has wrong formatting except the table in the code, so I can not use

我有外部 HTML 站点,我需要从该站点的表中提取数据。但是HTML网站的源代码格式错误,除了代码中的表格,所以我不能使用

xmllint --html --xpath <xpath> <file>

because it does not work properly, when HTML formatting on the site is broken.

因为它不能正常工作,当网站上的 HTML 格式被破坏时。

My idea was to use curl and delete code above and below the table. When table is extracted, code is clean and it fits to xmllinttool (I can use xpath then). However delete everything above the match is challenging for shell as you can see here: Sed doesn't backtrack: once it's processed a line, it's done.Is there a way how to extract only the code of the table from the HTML site in bash? Suposse, code has this structure.

我的想法是使用 curl 并删除表格上方和下方的代码。提取表时,代码是干净的并且适合xmllint工具(然后我可以使用 xpath)。但是,删除匹配项上方的所有内容对于 shell 来说是具有挑战性的,正如您在此处看到的:Sed 不会回溯:一旦处理了一行,就完成了。有没有办法如何从 bash 的 HTML 站点中仅提取表格的代码?假设,代码具有这种结构。

<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

And I need output like this to parse data properly:

我需要这样的输出来正确解析数据:

  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

Please, do not give me minus because of trying to use bash.

请不要因为尝试使用 bash 而给我减号。

回答by Inian

I will break down the answer which I tried using xmllintwhich supports a --htmlflag for parsing htmlfiles

我将分解我尝试使用的答案,该答案xmllint支持--html用于解析html文件的标志

Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-

首先,您可以通过如下解析来检查您的 HTML 文件的完整性,以确认文件是否符合标准或在看到时抛出错误:-

$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

with my original YourHTML.htmlfile just being the input HTML file in your question.

我的原始YourHTML.html文件只是您问题中的输入 HTML 文件。

Now for the value extraction part:-

现在是值提取部分:-

Starting the file parsing from root-node to the the tablenode (//html/body/table) and running xmllintin HTML parser & interactive shell mode (xmllint --html --shell)

启动文件从根节点解析到table节点 ( //html/body/table) 并xmllint以 HTML 解析器和交互式 shell 模式运行 ( xmllint --html --shell)

Running the command plainly produces a result,

运行命令会产生一个结果,

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html
/ >  -------
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
/ > 

Now removing the special characters using sedi.e. sed '/^\/ >/d'produces

现在使用sediesed '/^\/ >/d'生成删除特殊字符

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d'
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

which is the output structure as you expected. Tested on xmllint: using libxml version 20900

这是您预期的输出结构。经过测试xmllint: using libxml version 20900

I will go one more step ahead, and if you want to fetch the values within the tabletag, you can apply the sedcommand to extract them as

我将再向前迈出一步,如果您想获取table标签中的值,您可以应用sed命令将它们提取为

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact

回答by experiment.pl

I'm not sure why nobody mentioned pure Bash solution, despite of its limitation (such as a file without endings of html tags on the same line- nevertheless you said you've cleaned the .html)

我不确定为什么没有人提到纯 Bash 解决方案,尽管它有局限性(例如在同一行上没有 html 标签结尾的文件 - 不过你说你已经清理了 .html)

For your purposes a quick solution would be a 1-liner:

出于您的目的,一个快速的解决方案是 1-liner:

sed -n '/<table class="my-table">/,/<\/table>/p'  <file>

Explanation: print everything between two specified tags, in this case <table>

说明:打印两个指定标签之间的所有内容,在这种情况下 <table>

You could also easily make a tag variable for e.g <body>or <p>and change the output on the fly. But the above solution gives what you asked for without external tools.

您还可以轻松地为例如<body>or制作标签变量,<p>并即时更改输出。但是上面的解决方案在没有外部工具的情况下提供了您所要求的。