java:从xml中删除cdata标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6836730/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 17:30:37  来源:igfitidea点击:

java: remove cdata tag from xml

javaregexxsltxpathcdata

提问by SandyBr

xpath is nice for parsing xml files, but its not working for data inside the cdata tag:

xpath 非常适合解析 xml 文件,但它不适用于 cdata 标签内的数据:

<![CDATA[ Some Text <p>more text and tags</p>... ]]>

My solution: Get the content of the xml first and remove

我的解决方案:先获取xml的内容并删除

"<![CDATA["  and  "]]>".

After that I would run xpath "to reach everything" from the xml file. Is there a better solution? If not, how can I do it with a regular expression?

之后,我将运行 xpath 从 xml 文件“到达所有内容”。有更好的解决方案吗?如果没有,我怎么能用正则表达式做到这一点?

回答by Pa?lo Ebermann

The reason for the CDATA tags there is that everything inside them is pure text, nothing which should be interpreted directly as XML. You could write your document fragment in the question alternatively as

使用 CDATA 标签的原因是其中的所有内容都是纯文本,不应直接解释为 XML。您可以在问题中编写您的文档片段,或者

 Some Text &lt;p&gt;more text and tags&lt;/p&gt;... 

(with a leading and trailing space).

(带有前导和尾随空格)。

If you really want to interpret this as XML, extract the text from your document, and submit it to an XML parser again.

如果您真的想将其解释为 XML,请从您的文档中提取文本,然后再次将其提交给 XML 解析器。

回答by Alberto

I needed to accomplish the same task. I have solved it with two xslt.

我需要完成同样的任务。我已经用两个 xslt 解决了它。

Just let me stress that this will only work if the CDATAis well-formed xml.

只是我要强调的是,如果这只会工作CDATA格式良好的XML

To be complete, let me add to your example xml a root element:

为了完整起见,让我在您的示例 xml 中添加一个根元素:

<root>
   <well-formed-content><![CDATA[ Some Text <p>more text and tags</p>]]>
   </well-formed-content>
</root>

Fig 1.- Starting xml

图 1.- 启动 xml



First step

第一步

In the first transformation step, I have wrapped all text nodes in a new introduced xml entity old_text:

在第一个转换步骤中,我将所有文本节点包装在一个新引入的 xml 实体中old_text

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

Fig 2.- First xslt (wrapping CDATA in "old_text" elements)

图 2.- 第一个 xslt(将 CDATA 包装在“old_text”元素中)

If you apply this transformation to the starting xml this is what you get (I'm not reformatting it to avoid confusion about who does what):

如果您将此转换应用于起始 xml,这就是您得到的(我不会重新格式化它以避免混淆谁做什么):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text> Some Text <p>more text and tags</p>
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 3.- Transformed xml (first step)

图 3.- 转换后的 xml(第一步)



Second step

第二步

You now need to clean-up the introduced old_textelements, and re-escape the text that didn't create new nodes:

您现在需要清理引入的old_text元素,并重新转义没有创建新节点的文本:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <!-- Element-nodes: Process nodes and their children -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!--
        'Wrapper'-node: remove the wrapper element but process its children.
        With this matcher, the "old_text" is cleaned, but the originally CDATA
        well-formed nodes surface in the resulting xml.
    -->
    <xsl:template match="old_text">
        <xsl:apply-templates select="*|text()" />
    </xsl:template>

    <!--
        Text-nodes: Text here comes from original CDATA and must be now
        escaped. Note that the previous rule has extracted all the existing
        nodes in the CDATA. -->
    <xsl:template match="text()">
        <xsl:value-of select="." disable-output-escaping="no" />
    </xsl:template>

</xsl:stylesheet>

Fig 4.- 2nd xslt (cleaned-up artificially-introduced elements)

图 4.- 第二个 xslt(清理人工引入的元素)



Result

结果

This is the final result, with the nodes that originally where in CDATA expanded in your new xml file:

这是最终结果,最初在 CDATA 中的节点在新的 xml 文件中展开:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content> Some Text <p>more text and tags</p>
    </well-formed-content>
</root>

Fig 5.- Final xml

图 5.- 最终的 xml



Caveat

警告

If your CDATA contains html character entities not supported in xml (take a look for examples at this wikipedia article about character entities), you need to add those references to your intermediate xml. Let me show this with an example:

如果您的 CDATA 包含 xml 中不支持的 html 字符实体(请查看这篇关于字符实体的维基百科文章中的示例),您需要将这些引用添加到您的中间 xml。让我用一个例子来说明这一点:

<root>
    <well-formed-content>
        <![CDATA[ Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.]]>
    </well-formed-content>
</root>

Fig 6.- Added character entity &nbsp;to xml in Fig 1

图 6.-&nbsp;向图 1 中的 xml添加字符实体

The original xslt from Fig 2will convert the xml into this:

图 2 中的原始 xslt会将 xml 转换为:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 7.- Result of a first try to convert the xml in Fig 6 (Not well-formed!)

图 7.- 第一次尝试转换图 6 中的 xml 的结果(格式不正确!)

The problem with this file is that it is not well-formed, and thus, cannot be further processed with a XSLT-processor:

此文件的问题在于它的格式不正确,因此无法使用 XSLT 处理器进行进一步处理:

The entity "nbsp" was referenced, but not declared.
XML checking finished.

Fig 8.- Result of the well-formedness checking for the xml in Fig 7

图 8.- 图 7 中 xml 格式良好检查的结果

This workaround does the trick (the match="/"template adds the &nbsp;entity):

此解决方法可以解决问题(match="/"模板添加&nbsp;实体):

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
                encoding="UTF-8" standalone="yes" />

    <!-- Add an html entity to the xml character entities declaration. -->
    <xsl:template match="/">
        <xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>
]]>
        </xsl:text>
        <xsl:apply-templates select="*" />
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet> 

Fig 9.- The xslt creates the entity declaration

图 9.- xslt 创建实体声明

Now, after applying this xslt to the Fig 6source xml, this is the intermediate xml:

现在,在将此 xslt 应用到图 6源 xml 之后,这是中间 xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>

        <root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

Fig 10.- Intermediate xml (xml from Fig 3 plus entity declaration)

图 10.- 中间 xml(图 3 中的 xml 加上实体声明)

You can use the xslt transformation from Fig 4to produce the final xml:

您可以使用图 4 中的 xslt 转换来生成最终的 xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:?.
    </well-formed-content>
</root>

Fig 11.- Final xml with html entites converted to UTF-8

图 11.- 将 html 实体转换为 UTF-8 的最终 xml



Notes

笔记

For these examples I have used NetBeans 7.1.2 built-in XSLT processor (com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor)

对于这些示例,我使用了 NetBeans 7.1.2 内置 XSLT 处理器 ( com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor)

Disclaimer: I'm not an XML expert. I have the feeling that this should be even easier...

免责声明:我不是 XML 专家。我觉得这应该更容易......

回答by james.garriss

To strip the CDATA and keep the tags as tags, you could use XSLT.

要剥离 CDATA 并将标签保留为标签,您可以使用 XSLT。

Given this XML input:

鉴于此 XML 输入:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
    <child>Here is some text.</child>
    <child><![CDATA[Here is more text <p>with tags</p>.]]></child>
</root>

Using this XSLT:

使用这个 XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Will return the following XML:

将返回以下 XML:

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <child>Here is some text.</child>
   <child>Here is more text <p>with tags</p>.</child>
</root>

(Tested with Saxon HE 9.3.0.5 in oXygen 12.2)

(在 oXygen 12.2 中使用 Saxon HE 9.3.0.5 测试)

Then you could use xPath to extract the contents of the pelement:

然后你可以使用 xPath 来提取p元素的内容:

/root/child/p

回答by mukesh.stackOverflow

You can definitely remove the cdata from xml by using the regex to remove the desired content from your xml.

您绝对可以通过使用正则表达式从 xml 中删除所需的内容来从 xml 中删除 cdata。

for example:

例如:

String s = "<sn><![CDATA[poctest]]></sn>";
s = s.replaceAll("!\[CDATA", "");
s = s.replaceAll("]]", "");
s = s.replaceAll("\[", "");        

Result will be:

结果将是:

<sn><poctest></sn>

Please check,if this solves your issue.

请检查,这是否解决了您的问题。

回答by thomas.adamjak

Try this:

试试这个:

public static removeCDATA (String text) {
    String resultString = "";
    Pattern regex = Pattern.compile("(?<!(<!\[CDATA\[))|((.*)\w+\W)");
    Matcher regexMatcher = regex.matcher(text);
    while (regexMatcher.find()) {
        resultString += regexMatcher.group();
    }
    return resultString;
}

When I call this method with your test input <![CDATA[ Some Text <p>more text and tags</p>... ]]>method return Some Text <p>more text and tags</p>

当我用你的测试输入<![CDATA[ Some Text <p>more text and tags</p>... ]]>法返回调用这个方法时Some Text <p>more text and tags</p>

But I think this method without regular expressions will be more reliable. Something like this:

但是我认为这种没有正则表达式的方法会更可靠。像这样的东西:

public static removeCDATA (String text) {
    s = s.trim();
    if (s.startsWith("<![CDATA[")) {
        s = s.substring(9);
        int i = s.indexOf("]]>");
        if (i == -1) throw new IllegalStateException("argument starts with <![CDATA[ but cannot find pairing ]]>");
        s = s.substring(0, i);
    }
    return s;
}