如何从 Unix 命令行中删除 XML 标签?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5376024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 14:36:49  来源:igfitidea点击:

How to remove XML tags from Unix command line?

xmlshellunixcommand-linexml-parsing

提问by Tarski

I am grepping an XML File, which gives me output like this:

我正在搜索一个 XML 文件,它给了我这样的输出:

<tag>data</tag>
<tag>more data</tag>
...

Note, this is a flat file, not an XML tree. I want to remove the XML tags and just display the data in between. I'm doing all this from the command line and was wondering if there is a better way than piping it into awk twice...

请注意,这是一个平面文件,而不是 XML 树。我想删除 XML 标签并只显示其间的数据。我正在从命令行执行所有这些操作,并且想知道是否有比将其两次输入 awk 更好的方法......

cat file.xml | awk -F'>' '{print }' | awk -F'<' '{print }'

Ideally, I would like to do this in one command

理想情况下,我想在一个命令中执行此操作

回答by Johnsyweb

If your file looks just like that, then sedcan help you:

如果您的文件看起来像这样,那么sed可以帮助您:

sed -e 's/<[^>]*>//g' file.xml

Of course you should not use regular expressions for parsing XMLbecause it's hard.

当然,您不应该使用正则表达式来解析 XML,因为它很难

回答by dogbane

Using awk:

使用 awk:

awk '{gsub(/<[^>]*>/,"")};1' file.xml

回答by Paused until further notice.

Give this a try:

试试这个:

grep -Po '<.*?>\K.*?(?=<.*?>)' inputfile

Explanation:

解释:

Using Perl Compatible Regular Expressions (-P) and outputting only the specified matches (-o):

使用 Perl 兼容正则表达式 ( -P) 并仅输出指定的匹配项 ( -o):

  • <.*?>- Non-greedy match of any characters within angle brackets
  • \K- Don't include the preceding match in the output (reset match start - similar to positive look-behind, but it works with variable-length matches)
  • .*?- Non-greedy match stopping at the next match (this part will be output)
  • (?=<.*?>)- Non-greedy match of any characters within angle brackets and don't include the match in the output (positive look-ahead - works with variable-length matches)
  • <.*?>- 尖括号内任何字符的非贪婪匹配
  • \K- 不要在输出中包含前面的匹配(重置匹配开始 - 类似于正向后视,但它适用于可变长度匹配)
  • .*?- 非贪婪匹配停止在下一场比赛(这部分将被输出)
  • (?=<.*?>)- 尖括号内任何字符的非贪婪匹配,并且不包括输出中的匹配(正向预测 - 适用于可变长度匹配)

回答by kenorb

Use html2textcommand-line tool, which converts html into plain text.

使用html2text命令行工具,将 html 转换为纯文本。

Alternatively you may try ex-way:

或者,您可以尝试ex 方式

ex -s +'%s/<[^>].\{-}>//ge' +%p +q! file.txt

or:

或者:

cat file.txt | ex -s +'%s/<[^>].\{-}>//ge' +%p +q! /dev/stdin

回答by SielaQ

I know this is not a "perlgolf contest", but I used to use this trick.

我知道这不是“perlgolf 比赛”,但我曾经使用过这个技巧。

Set Record Separator for <or >, then print only odd lines:

<or设置记录分隔符>,然后只打印奇数行:

awk -vRS='<|>' NR%2 file.xml