bash 删除 xml 文件中两个标签之间的 EOL 和空格

Question

提问by Tony Morris

I have a lot (more than 50) xml files with some lines (more than 30/40 per files) with incorrect formatting :

我有很多（超过 50 个）xml 文件，其中一些行（每个文件超过 30/40 个）格式不正确：

example, this:

例如，这个：

<TAG1>
    <TAG_TO_FORMAT>
           a_random_string

    </TAG_TO_FORMAT>
    <AN_OTHER_TAG_TO_FORMAT>
                       an_other_random_string
    </AN_OTHER_TAG_TO_FORMAT>
    <OTHER_TAG>pifpafpouf</OTHER_TAG>

</TAG1>

should be transform into this:

应该改成这样：

<TAG1>
    <TAG_TO_FORMAT>a_random_string</TAG_TO_FORMAT>
    <AN_OTHER_TAG_TO_FORMAT>an_other_random_string</AN_OTHER_TAG_TO_FORMAT>
    <OTHER_TAG>pifpafpouf</OTHER_TAG>

</TAG1>

it doesn't matter if the new line before </TAG1>is still present, my key problem is that each pattern : <TAG>random_string</TAG>must be on one line (the random_string does not contain an EOL)

之前的新行</TAG1>是否仍然存在并不重要，我的关键问题是每个模式：<TAG>random_string</TAG>必须在一行上（random_string 不包含 EOL）

I couldn't find any tool in bash allowing me to perform this, so how could i do this in bash ? (or maybe in python but i would prefer bash).

我在 bash 中找不到任何允许我执行此操作的工具，那么我如何在 bash 中执行此操作？（或者可能在 python 中，但我更喜欢 bash）。

Answer 1

回答by Magnus

There are command line tools like xmllint and tidy that can be used like this:

有像 xmllint 和 tidy 这样的命令行工具可以这样使用：

tidy -xml -iq somefile.xml

In theory xmllint can also do it, but xmllint doesnt work as described for me on OS X (dont have a Linux instance handy to test there at the moment):

理论上 xmllint 也可以做到，但是 xmllint 在 OS X 上不能像我描述的那样工作（目前没有 Linux 实例可以方便地在那里测试）：

xmllint --noblanks somefile.xml

Answer 2

回答by Brian Chrisman

Tidy does a reasonable job. Another option is to throw on an xslt transform calling normalize-space()

Tidy 做得很合理。另一种选择是调用 normalize-space() 进行 xslt 转换

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>
<xsl:template match="@*|node()|/">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()">
            <xsl:sort select="@kname"/>
        </xsl:apply-templates>
    </xsl:copy>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="normalize-space(text())"/>
</xsl:template>

and I'd save that into a file and if from command line run

我会把它保存到一个文件中，如果从命令行运行

xsltproc normalize-space.xsl file.xml

or in a pipeline

或在管道中

run_some_command | xsltproc normalize-space.xsl - | xmllint --format -

xmllint --noblanks does not characterize all the space characters I want necessarily as 'ignorable'. It's almost certainly technically correct, but not what I want.

xmllint --noblanks 并没有将我想要的所有空格字符都定性为“可忽略”。这在技术上几乎可以肯定是正确的，但不是我想要的。

Answer 3

回答by Ansgar Wiechers

I'd recommend Perl for this kind of task.

我会推荐 Perl 来完成这种任务。

#!/usr/bin/env perl

use strict;
use warnings;

my $text = join "", <>; 
$text =~ s/>\s+([^\s].*?[^\s])\s+<\//><\//;
print "$text";

Call it like this:

像这样调用它：

my.pl < input.xml > output.xml

Answer 4

回答by William

Well, you can do it in sed:

好吧，你可以在 sed 中做到：

x='TAG_TO_FORMAT'
sed -e '/<'"$x"'>/{:next;/<\/'"$x"'>/!{N;bnext;};s/\n//g;s/>\s*/>/;s/\S\s*</</;}'

When a line begins with the correct tag we go into a loop collecting lines until we find the closing tag. Then we erase all the newlines and clean up spaces anchored by > one one side, and < on the other.

当一行以正确的标签开始时，我们进入一个循环收集行，直到找到结束标签。然后我们擦除所有换行符并清理由 > 一侧和 < 另一侧锚定的空间。

bash 删除 xml 文件中两个标签之间的 EOL 和空格

提问by Tony Morris

回答by Magnus

回答by Brian Chrisman

回答by Ansgar Wiechers

回答by William

相关推荐

最近更新

标签

bash 删除 xml 文件中两个标签之间的 EOL 和空格

提问by Tony Morris

回答by Magnus

回答by Brian Chrisman

回答by Ansgar Wiechers

回答by William

相关推荐

bash Sed/Awk - 在模式 x 和 y 之间拉线

bash Bash读取txt文件并存储在数组中

bash cp：静音“省略目录”警告

Bash 在分隔符上拆分字符串，将段分配给数组

相关推荐

最近更新

标签