如何从 Unix 命令行中删除 XML 标签？

Question

提问by Tarski

I am grepping an XML File, which gives me output like this:

我正在搜索一个 XML 文件，它给了我这样的输出：

<tag>data</tag>
<tag>more data</tag>
...

Note, this is a flat file, not an XML tree. I want to remove the XML tags and just display the data in between. I'm doing all this from the command line and was wondering if there is a better way than piping it into awk twice...

请注意，这是一个平面文件，而不是 XML 树。我想删除 XML 标签并只显示其间的数据。我正在从命令行执行所有这些操作，并且想知道是否有比将其两次输入 awk 更好的方法......

cat file.xml | awk -F'>' '{print }' | awk -F'<' '{print }'

Ideally, I would like to do this in one command

理想情况下，我想在一个命令中执行此操作

Answer 1

回答by Johnsyweb

If your file looks just like that, then sedcan help you:

如果您的文件看起来像这样，那么sed可以帮助您：

sed -e 's/<[^>]*>//g' file.xml

Of course you should not use regular expressions for parsing XML because it's hard.

当然，您不应该使用正则表达式来解析 XML，因为它很难。

Answer 2

回答by dogbane

Using awk:

使用 awk：

awk '{gsub(/<[^>]*>/,"")};1' file.xml

Answer 3

回答by Paused until further notice.

Give this a try:

试试这个：

grep -Po '<.*?>\K.*?(?=<.*?>)' inputfile

Explanation:

解释：

Using Perl Compatible Regular Expressions (-P) and outputting only the specified matches (-o):

使用 Perl 兼容正则表达式 ( -P) 并仅输出指定的匹配项 ( -o)：

<.*?>- Non-greedy match of any characters within angle brackets
\K- Don't include the preceding match in the output (reset match start - similar to positive look-behind, but it works with variable-length matches)
.*?- Non-greedy match stopping at the next match (this part will be output)
(?=<.*?>)- Non-greedy match of any characters within angle brackets and don't include the match in the output (positive look-ahead - works with variable-length matches)

<.*?>- 尖括号内任何字符的非贪婪匹配
\K- 不要在输出中包含前面的匹配（重置匹配开始 - 类似于正向后视，但它适用于可变长度匹配）
.*?- 非贪婪匹配停止在下一场比赛（这部分将被输出）
(?=<.*?>)- 尖括号内任何字符的非贪婪匹配，并且不包括输出中的匹配（正向预测 - 适用于可变长度匹配）

Answer 4

回答by kenorb

Use html2textcommand-line tool, which converts html into plain text.

使用html2text命令行工具，将 html 转换为纯文本。

Alternatively you may try ex-way:

或者，您可以尝试ex 方式：

ex -s +'%s/<[^>].\{-}>//ge' +%p +q! file.txt

or:

或者：

cat file.txt | ex -s +'%s/<[^>].\{-}>//ge' +%p +q! /dev/stdin

Answer 5

回答by SielaQ

I know this is not a "perlgolf contest", but I used to use this trick.

我知道这不是“perlgolf 比赛”，但我曾经使用过这个技巧。

Set Record Separator for <or >, then print only odd lines:

为<or设置记录分隔符>，然后只打印奇数行：

awk -vRS='<|>' NR%2 file.xml

如何从 Unix 命令行中删除 XML 标签？

提问by Tarski

回答by Johnsyweb

回答by dogbane

回答by Paused until further notice.

回答by kenorb

回答by SielaQ

相关推荐

最近更新

标签

如何从 Unix 命令行中删除 XML 标签？

提问by Tarski

回答by Johnsyweb

回答by dogbane

回答by Paused until further notice.

回答by kenorb

回答by SielaQ

相关推荐

如何正确引用本地 XML 架构文件？

xml 如何在 XSLT 中更改或重新分配变量？

xml 如何在 IE11 中运行 xPath 查询？

在 Sublime Text 2 中打开时如何自动缩进 XML 文件？

相关推荐

最近更新

标签