bash 使用 grep 从本地文件中的 HTML 标签中获取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3593124/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 22:32:32  来源:igfitidea点击:

Getting text from inside an HTML tag within a local file with grep

htmlregexbashscreen-scrapinggrep

提问by LakeMicrobe

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

可能的重复:
RegEx 匹配除 XHTML 自包含标签之外的开放标签

Excerpt From Input File

输入文件摘录

<TD class="clsTDLabelWeb" width="28%">Municipality:&nbsp;</TD>
<TD style="WIDTH: 394px" class="clsTDLabelSm" colSpan="5">
<span id="DInfo1_Municipality">JUPITER</span></TD>

My Regular Expression

我的正则表达式

(?<=<span id="DInfo1_Municipality">)([^</span>]*)

I have an HTML file saved to disk. I would like to use grep to search through the file and output the contents of a specific span, though I don't know if this is a proper use of grep. When I run grep on the file with the expression read from another file (so I dont mess up escaping any special characters), it doesn't output anything. I have tested the expression in RegExr and it matches "JUPITER" which is exactly what I want returned. Thank you so much for your help!

我有一个保存到磁盘的 HTML 文件。我想使用 grep 搜索文件并输出特定 span 的内容,但我不知道这是否正确使用 grep。当我使用从另一个文件中读取的表达式对文件运行 grep 时(这样我就不会搞砸转义任何特殊字符),它不会输出任何内容。我已经测试了 RegExr 中的表达式,它匹配“JUPITER”,这正是我想要返回的。非常感谢你的帮助!

Desired Output

期望输出

JUPITER

回答by Paused until further notice.

Give this a try:

试试这个:

sed -n 's|^<span id="DInfo1_Municipality">\([^<]*\)</span></TD>$||p' file

or with GNU grepand your regex:

或使用 GNUgrep和您的正则表达式:

grep -Po '(?<=<span id="DInfo1_Municipality">)([^</span>]*)'

回答by Paul Creasey

Grep doesn't support that type of regex (lookbehind assertions), and its a very poor tool for this, but for the example given it is workable, will break under many situtions.

Grep 不支持这种类型的正则表达式(后视断言),它是一个非常糟糕的工具,但对于给出的例子来说它是可行的,在许多情况下都会中断。

grep -io "<span id=\"DInfo1_Municipality\">.*</span>" file.htlm | grep -io ">[^<]*" | grep -io [^>]*

something crazy like that, not a good idea.

像那样疯狂的事情,不是一个好主意。

回答by ghostdog74

sed -n '/DInfo1_Municipality/s/<\/span.*//p' file | sed 's/.*>//'