bash 从网页中删除所有 HTML 标签

Question

提问by David W.

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curlis usually in HTML. I figured that if I can strip out all of the HTML tags, I could display the resulting text as an error message.

我正在使用 .bashrc 编写一些 BASH shell 脚本curl。如果我的 curl 命令返回任何文本，我知道我有错误。这个由返回的文本curl通常是 HTML 格式。我想如果我能去掉所有的 HTML 标签，我就可以将结果文本显示为错误消息。

I was thinking of something like this:

我在想这样的事情：

sed -E 's/<.*?>//g' <<<$output_text

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

但我得到 sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

If I replace *?with *, I don't get the error (and I don't get any text either). If I remove the global(g) flag, I get the same error.

如果我替换*?为*，则不会收到错误消息（也不会收到任何文本）。如果删除全局( g) 标志，则会出现相同的错误。

This is on Mac OS X.

这是在 Mac OS X 上。

Answer 1

回答by Kent

sed doesn't support non-greedy.

sed 不支持非贪婪。

try

尝试

's/<[^>]*>//g'

Answer 2

回答by jm666

Maybe parser-based perl solution?

也许基于解析器的 perl 解决方案？

perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML::Stripmodule with cpan HTML::Stripcommand.

您必须使用命令安装HTML::Strip模块cpan HTML::Strip。

alternatively

或者

you can use an standard OS X utility called: textutilsee the man page

您可以使用名为的标准 OS X 实用程序：textutil请参阅手册页

textutil -convert txt file.html

will produce file.txtwith stripped html tags, or

将产生file.txt剥离的 html 标签，或

textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

另一种选择

Some systems get installed the lynxtext-only browser. You can use the:

一些系统安装了lynx纯文本浏览器。您可以使用：

lynx -dump file.html #or
lynx -stdin -dump < file.html

But in your case, you can rely only on pure sedor awksolutions... IMHO.

但在你的情况下，你只能依靠纯sed或awk解决方案......恕我直言。

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

但是，如果您有 perl（并且只有没有 HTML::Strip 模块），那么下一个仍然更好，因为 sed

perl -0777 -pe 's/<.*?>//sg'

because will remove the next (multiline and common) tag too:

因为也会删除下一个（多行和通用）标签：

<a
 href="#"
 class="some"
>link text</a>

Answer 3

回答by captcha

Code for GNU sed:

GNU sed代码：

sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file

This might fail, you should better use a html-parsingtool.

这可能会失败，您最好使用html 解析工具。

Answer 4

回答by Mohsen Abasi

If you want to remove all HTML tags and also all script tags (and their contents), you can use the following:

如果要删除所有 HTML 标记以及所有脚本标记（及其内容），可以使用以下命令：

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g' $file -i && sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' $file -i && sed -r '/^\s*$/d' $file -i

bash 从网页中删除所有 HTML 标签

提问by David W.

回答by Kent

回答by jm666

回答by captcha

回答by Mohsen Abasi

相关推荐

最近更新

标签

bash 从网页中删除所有 HTML 标签

提问by David W.

回答by Kent

回答by jm666

回答by captcha

回答by Mohsen Abasi

相关推荐

bash 为什么从 grep 输出创建文件时会得到一些额外的、奇怪的字符？

bash 如果时间大于 X 小于 Y，在黎明之间？

bash shell脚本中的wait命令

bash shell 脚本中的“期望一元运算符”

相关推荐

最近更新

标签