有没有办法在 xml 中转义 CDATA 结束标记?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/223652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:09:54  来源:igfitidea点击:

Is there a way to escape a CDATA end token in xml?

xmlescapingcdata

提问by Juan Pablo Califano

I was wondering if there is any way to escape a CDATA end token (]]>) within a CDATA section in an xml document. Or, more generally, if there is some escape sequence for using within a CDATA (but if it exists, I guess it'd probably only make sense to escape begin or end tokens, anyway).

我想知道是否有任何方法可以]]>在 xml 文档的 CDATA 部分中转义 CDATA 结束标记 ( )。或者,更一般地说,如果在 CDATA 中使用一些转义序列(但如果它存在,我想它可能只对转义开始或结束标记才有意义,无论如何)。

Basically, can you have a begin or end token embedded in a CDATA and tell the parser not to interpret it but to treat it as just another character sequence.

基本上,您是否可以在 CDATA 中嵌入开始或结束标记,并告诉解析器不要解释它,而是将其视为另一个字符序列。

Probably, you should just refactor your xml structure or your code if you find yourself trying to do that, but even though I've been working with xml on a daily basis for the last 3 years or so and I have never had this problem, I was wondering if it was possible. Just out of curiosity.

也许,如果您发现自己尝试这样做,您应该重构您的 xml 结构或代码,但即使我在过去 3 年左右的时间里每天都在使用 xml 并且我从来没有遇到过这个问题,我想知道是否有可能。只是出于好奇。

Edit:

编辑:

Other than using html encoding...

除了使用 html 编码...

回答by S.Lott

You have to break your data into pieces to conceal the ]]>.

您必须将数据分成几部分以隐藏]]>.

Here's the whole thing:

这是整件事:

<![CDATA[]]]]><![CDATA[>]]>

<![CDATA[]]]]><![CDATA[>]]>

The first <![CDATA[]]]]>has the ]]. The second <![CDATA[>]]>has the >.

第一个<![CDATA[]]]]>]]. 第二个<![CDATA[>]]>>

回答by ddaa

Clearly, this question is purely academic. Fortunately, it has a very definite answer.

显然,这个问题纯粹是学术性的。幸运的是,它有一个非常明确的答案。

You cannot escape a CDATA end sequence. Production rule 20 of the XML specificationis quite clear:

您不能转义 CDATA 结束序列。XML规范的产生式规则 20非常清楚:

[20]    CData      ::=      (Char* - (Char* ']]>' Char*))

EDIT: This product rule literally means "A CData section may contain anything you want BUT the sequence ']]>'. No exception.".

编辑:此产品规则字面意思是“CData 部分可能包含您想要的任何内容,但序列 ']]>'。也不例外。”。

EDIT2: The same sectionalso reads:

EDIT2:同一部分还写道:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "&lt;" and "&amp;". CDATA sections cannot nest.

在 CDATA 部分中,只有 CDEnd 字符串被识别为标记,因此左尖括号和与符号可能以其文字形式出现;它们不需要(也不能)使用“ &lt;”和“ &amp;”进行转义。CDATA 节不能嵌套。

In other words, it's not possible to use entity reference, markup or any other form of interpreted syntax. The only parsed text inside a CDATA section is ]]>, and it terminates the section.

换句话说,不可能使用实体引用、标记或任何其他形式的解释语法。CDATA 部分中唯一解析的文本是]]>,它终止了该部分。

Hence, it is not possible to escape ]]>within a CDATA section.

因此,不可能]]>在 CDATA 部分内转义。

EDIT3: The same sectionalso reads:

EDIT3:同一部分还写道:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

2.7 CDATA 部分

[定义:CDATA 节可能出现在任何可能出现字符数据的地方;它们用于转义包含字符的文本块,否则这些字符将被识别为标记。CDATA 部分以字符串“<![CDATA[”开头,以字符串“]]>”结尾:]

Then there may be a CDATA section anywhere character data may occur, including multiple adjacent CDATA sections inplace of a single CDATA section. That allows it to be possible to split the ]]>token and put the two parts of it in adjacent CDATA sections.

然后在任何可能出现字符数据的地方都有一个 CDATA 部分,包括多个相邻的 CDATA 部分而不是单个 CDATA 部分。这允许拆分]]>令牌并将其的两个部分放在相邻的 CDATA 部分中。

ex:

前任:

<![CDATA[Certain tokens like ]]> can be difficult and <invalid>]]> 

should be written as

应该写成

<![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid>]]> 

回答by Jason Pyeron

You do not escape the ]]>but you escape the >after ]]by inserting ]]><![CDATA[before the >, think of this just like a \in C/Java/PHP/Perl string but only needed before a >and after a ]].

你没有转义 the]]>但你通过在 the 之前插入来转义>after ,把它想象成C/Java/PHP/Perl 字符串中的 a ,但只需要在 a 之前和之后。]]]]><![CDATA[>\>]]

BTW,

顺便提一句,

S.Lott's answer is the same as this, just worded differently.

S.Lott 的回答与此相同,只是措辞不同。

回答by Robert Rossney

S. Lott's answer is right: you don't encode the end tag, you break it across multiple CDATA sections.

S. Lott 的答案是正确的:您不对结束标签进行编码,而是将其分解为多个 CDATA 部分。

How to run across this problem in the real world: using an XML editor to create an XML document that will be fed into a content-management system, try to write an article about CDATA sections. Your ordinary trick of embedding code samples in a CDATA section will fail you here. You can imagine how I learned this.

如何在现实世界中解决这个问题:使用 XML 编辑器创建一个 XML 文档,该文档将输入内容管理系统,尝试写一篇关于 CDATA 部分的文章。在 CDATA 部分中嵌入代码示例的普通技巧将在这里失败。你可以想象我是如何学到这一点的。

But under most circumstances, you won't encounter this, and here's why: if you want to store (say) the text of an XML document as the content of an XML element, you'll probably use a DOM method, e.g.:

但在大多数情况下,您不会遇到这种情况,原因如下:如果您想将 XML 文档的文本存储(例如)作为 XML 元素的内容,您可能会使用 DOM 方法,例如:

XmlElement elm = doc.CreateElement("foo");
elm.InnerText = "<[CDATA[[Is this a problem?]]>";

And the DOM quite reasonably escapes the < and the >, which means that you haven't inadvertently embedded a CDATA section in your document.

DOM 相当合理地避开了 < 和 >,这意味着您没有在文档中无意中嵌入 CDATA 部分。

Oh, and this is interesting:

哦,这很有趣:

XmlDocument doc = new XmlDocument();

XmlElement elm = doc.CreateElement("doc");
doc.AppendChild(elm);

string data = "<![[CDATA[This is an embedded CDATA section]]>";
XmlCDataSection cdata = doc.CreateCDataSection(data);
elm.AppendChild(cdata);

This is probably an ideosyncrasy of the .NET DOM, but that doesn't throw an exception. The exception gets thrown here:

这可能是 .NET DOM 的一种特质,但这不会引发异常。这里抛出异常:

Console.Write(doc.OuterXml);

I'd guess that what's happening under the hood is that the XmlDocument is using an XmlWriter produce its output, and the XmlWriter checks for well-formedness as it writes.

我猜想在幕后发生的事情是 XmlDocument 正在使用 XmlWriter 生成其输出,而 XmlWriter 在写入时检查格式是否正确。

回答by Thomas Grainger

simply replace ]]>with ]]]]><![CDATA[>

简单地替换]]>]]]]><![CDATA[>

回答by Shawn Becker

Here's another case in which ]]>needs to be escaped. Suppose we need to save a perfectly valid HTML document inside a CDATA block of an XML document and the HTML source happens to have it's own CDATA block. For example:

这是另一个]]>需要转义的情况。假设我们需要在 XML 文档的 CDATA 块中保存一个完全有效的 HTML 文档,而 HTML 源恰好有它自己的 CDATA 块。例如:

<htmlSource><![CDATA[ 
    ... html ...
    <script type="text/javascript">
        /* <![CDATA[ */
        -- some working javascript --
        /* ]]> */
    </script>
    ... html ...
]]></htmlSource>

the commented CDATA suffix needs to be changed to:

注释的 CDATA 后缀需要更改为:

        /* ]]]]><![CDATA[> *//

since an XML parser isn't going to know how to handle javascript comment blocks

因为 XML 解析器不会知道如何处理 javascript 注释块

回答by user2194495

In PHP: '<![CDATA['.implode(explode(']]>', $string), ']]]]><![CDATA[>').']]>'

在 PHP 中: '<![CDATA['.implode(explode(']]>', $string), ']]]]><![CDATA[>').']]>'

回答by Alain Tiemblo

A cleaner way in PHP:

PHP 中更简洁的方法:

   function safeCData($string)
   {
      return '<![CDATA[' . str_replace(']]>', ']]]]><![CDATA[>', $string) . ']]>';
   }

Don't forget to use a multibyte-safe str_replace if required (non latin1 $string):

如果需要(非 latin1 $string),请不要忘记使用多字节安全的 str_replace :

   function mb_str_replace($search, $replace, $subject, &$count = 0)
   {
      if (!is_array($subject))
      {
         $searches = is_array($search) ? array_values($search) : array ($search);
         $replacements = is_array($replace) ? array_values($replace) : array ($replace);
         $replacements = array_pad($replacements, count($searches), '');
         foreach ($searches as $key => $search)
         {
            $parts = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
         }
      }
      else
      {
         foreach ($subject as $key => $value)
         {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
         }
      }
      return $subject;
   }

回答by honzar

I don't think that interrupting CDATA is good way to go. Here is my alternative...

我不认为中断 CDATA 是好的方法。这是我的替代方案...

Use ]for escape sequence followed by hex value of your character. Like in the &#xhhhh;=> ]<unicode value>;

使用]的转义序列,然后你的角色的十六进制值。就像在&#xhhhh;=>]<unicode value>;

This way if you try to record ]]>your encode fn will produce ]005D;]005D;]003E;which is ok in CDATA.

这样,如果您尝试记录]]>您的编码 fn 将产生]005D;]005D;]003E;在 CDATA 中是可以的。

It's better than escaping by entity name, because those are not decoded every time in your app and you may have different priorities for escaping entities with ampersand vs escaping some other chars/sequences. As a result you have more control over the content of CDATA.

这比通过实体名称转义要好,因为它们不会在您的应用程序中每次都被解码,并且您可能有不同的优先级来使用&符号转义实体与转义其他一些字符/序列。因此,您可以更好地控制 CDATA 的内容。

回答by Chad Kuehn

See this structure:

看到这个结构:

<![CDATA[
   <![CDATA[
      <div>Hello World</div>
   ]]]]><![CDATA[>
]]>

For the inner CDATA tag(s) you must close with ]]]]><![CDATA[>instead of ]]>. Simple as that.

对于内部 CDATA 标签,您必须用]]]]><![CDATA[>代替]]>。就那么简单。