php htmlentities() 是否足以创建 xml 安全值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2822774/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 07:46:48  来源:igfitidea点击:

Is htmlentities() sufficient for creating xml-safe values?

phpxmlxml-serialization

提问by John Himmelman

I'm building an XML file from scratch and need to know if htmlentities() converts every character that could potentially break an XML file (and possibly UTF-8 data)?

我正在从头开始构建一个 XML 文件,需要知道 htmlentities() 是否转换了每个可能破坏 XML 文件(可能还有 UTF-8 数据)的字符?

The values will be from a twitter/flickr feed, so I need to be sure-

这些值将来自 twitter/flickr 提要,所以我需要确定-

回答by Jon

htmlentities()is nota guaranteed way to build legal XML.

htmlentities()不是构建合法 XML 的保证方式。

Use htmlspecialchars()instead of htmlentities()if this is all you are worried about. If you have encoding mismatches between the representation of your data and the encoding of your XML document, htmlentities()may serve to work around/cover them up (it will bloat your XML size in doing so). I believe it's better to get your encodings consistent and just use htmlspecialchars().

如果这就是您所担心的全部,请使用htmlspecialchars()代替htmlentities()。如果您的数据表示和 XML 文档的编码之间存在编码不匹配,则htmlentities()可能有助于解决/掩盖它们(这样做会使您的 XML 大小膨胀)。我相信最好让您的编码保持一致并使用htmlspecialchars().

Also, be aware that if you pump the return value of htmlspecialchars()inside XML attributes delimited with single quotes, you will need to pass the ENT_QUOTESflag as well so that any single quotes in your source string are properly encoded as well. I suggest doing this anyway, as it makes your code immune to bugs resulting from someone using single quotes for XML attributes in the future.

另外,请注意,如果您抽取htmlspecialchars()用单引号分隔的内部 XML 属性的返回值,您还需要传递ENT_QUOTES标志,以便源字符串中的任何单引号也能正确编码。我建议无论如何都这样做,因为它使您的代码免受将来有人对 XML 属性使用单引号引起的错误的影响。

Edit:To clarify:

编辑:澄清:

htmlentities()will convert a number of non-ANSI characters (I assume this is what you mean by UTF-8 data) to entities (which are represented with just ANSI characters). However, it cannot do so for any characters which do not have a corresponding entity, and so cannot guarantee that its return value consists only of ANSI characters. That's why I 'm suggesting to not use it.

htmlentities()将许多非 ANSI 字符(我假设这就是您所说的 UTF-8 数据)转换为实体(仅用 ANSI 字符表示)。但是,对于没有相应实体的任何字符,它不能这样做,因此不能保证其返回值仅由 ANSI 字符组成。这就是为什么我建议不要使用它。

If encoding is a possible issue, handle it explicitly (e.g. with iconv()).

如果编码是一个可能的问题,请明确处理它(例如使用iconv())。

Edit 2: Improved answer taking into account Josh Davis's comment belowis .

编辑 2:考虑到 Josh Davis 下面的评论,改进了答案。

回答by Gordon

Dom::createTextNode()will automatically escape your content.

Dom::createTextNode()将自动转义您的内容。

Example:

例子:

$dom = new DOMDocument;
$element = $dom->createElement('Element');
$element->appendChild(
    $dom->createTextNode('I am text with ünic?dé & HTML ntities ?'));

$dom->appendChild($element);
echo $dom->saveXml();

Output:

输出:

<?xml version="1.0"?>
<Element>I am text with &#xDC;nic&#xF6;d&#xE9; &amp; HTML &#x20AC;ntities &#xA9;</Element>

When you set the internal encoding to utf-8, e.g.

当您将内部编码设置为 utf-8 时,例如

$dom->encoding = 'utf-8';

you'll still get

你仍然会得到

<?xml version="1.0" encoding="utf-8"?>
<Element>I am text with ünic?dé &amp; HTML ntities ?</Element>

Note that the above is not the same as setting the second argument $valuein Dom::createElement(). The method will only make sure your element names are valid. See the Notes on the manual page, e.g.

请注意,上面是不一样的设置第二个参数$valueDom::createElement()。该方法只会确保您的元素名称有效。请参阅手册页上的注释,例如

$dom = new DOMDocument;
$element = $dom->createElement('Element', 'I am text with ünic?dé & HTML ntities ?');
$dom->appendChild($element);
$dom->encoding = 'utf-8';
echo $dom->saveXml();

will result in a Warning

将导致警告

Warning: DOMDocument::createElement(): unterminated entity reference  HTML ntities ?

and the following output:

和以下输出:

<?xml version="1.0" encoding="utf-8"?>
<Element>I am text with ünic?dé </Element>

回答by Peter Krauss

The Gordon's answer is good and explain the XML encode problems, but not show a simple function (or what the blackbox do). Jon's answer starting well with the 'htmlspecialchars' function recomendation, but he and others do some mistake, then I will be emphatic.

Gordon 的回答很好并解释了 XML 编码问题,但没有显示一个简单的功能(或黑盒做什么)。Jon 的回答从 'htmlspecialchars' 函数推荐开始,但他和其他人犯了一些错误,那么我会强调。

A good programmer MUST have control about use or not of UTF-8in your strings and XML data: UTF-8 (or another non-ASCII encode) IS SAFE in a consistent algorithm.

一个好的程序员必须控制在你的字符串和 XML 数据中是否使用 UTF-8:UTF-8(或其他非 ASCII 编码)在一致的算法中是安全的。

SAFE UTF-8 XML NOT NEED FULL-ENTITY ENCODE. The indiscriminate encode produce "second class, non-human-readble, encode/decode-demand, XML". And safe ASCII XML, also not need entity encode, when all your content are ASCII.

安全的 UTF-8 XML 不需要完整的实体编码。不加选择的编码产生“二等的、非人类可读的、编码/解码需求的 XML”。并且安全的 ASCII XML,也不需要实体编码,当您的所有内容都是 ASCII 时。

Only 3 or 4 characters need to be escaped in a string of XML content: >, <, &, and optional ". Please read http://www.w3.org/TR/REC-xml/"2.4 Character Data and Markup" and "4.6 Predefined Entities". THEN YOU can use 'htmlentities'

只有3或4个字符需要的XML内容的字符串进行转义:><&,和可选的"。请阅读http://www.w3.org/TR/REC-xml/“2.4字符数据和标记”和“4.6 预定义实体”。然后你可以使用' htmlentities'

For illustration, the following PHP function will make a XML completely safe:

例如,以下 PHP 函数将使 XML 完全安全:

// it is a didactic illustration, USE htmlentities($S,flag)
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
    return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
    // SAME AS htmlspecialchars($s)
else
    return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), $s);
    // SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}

// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
   $out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"';
if ( $contents==='' || is_null($contents) )
    $out .= '/>';
else
    $out .= '>'.xmlsafe( $contents )."</$element>";
return $out;
}

In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.

在 CDATA 块中,您不需要使用此功能...但是,请避免滥用 CDATA。

回答by Josh Davis

So your question is "is htmlentities()'s result guaranteed to be XML-compliant and UTF-8-compliant?" The answer is no, it's not.

所以你的问题是“htmlentities() 的结果是否保证符合 XML 和 UTF-8?” 答案是否定的,不是。

htmlspecialchars() shouldbe enough to escape XML's special characters but you'll have to sanitize your UTF-8 strings either way. Even if you build your XML with, say, SimpleXML, you'll have to sanitize the strings. I don't know about other librairies such as XMLWriter or DOM, I think it's the same.

htmlspecialchars()应该足以转义 XML 的特殊字符,但无论如何您都必须清理您的 UTF-8 字符串。即使您使用 SimpleXML 构建 XML,您也必须清理字符串。我不知道 XMLWriter 或 DOM 等其他库,我认为它是一样的。

回答by Cylosh

Thought I'd add this for those who need to sanitize & not lose the XML attributes.

我想我会为那些需要清理而不丢失 XML 属性的人添加这个。

// Returns SimpleXML Safe XML keeping the elements attributes as well
function sanitizeXML($xml_content, $xml_followdepth=true){

    if (preg_match_all('%<((\w+)\s?.*?)>(.+?)</>%si', $xml_content, $xmlElements, PREG_SET_ORDER)) {

        $xmlSafeContent = '';

        foreach($xmlElements as $xmlElem){
            $xmlSafeContent .= '<'.$xmlElem['1'].'>';
            if (preg_match('%<((\w+)\s?.*?)>(.+?)</>%si', $xmlElem['3'])) {
                $xmlSafeContent .= sanitizeXML($xmlElem['3'], false);
            }else{
                $xmlSafeContent .= htmlspecialchars($xmlElem['3'],ENT_NOQUOTES);
            }
            $xmlSafeContent .= '</'.$xmlElem['2'].'>';
        }

        if(!$xml_followdepth)
            return $xmlSafeContent;
        else
            return "<?xml version='1.0' encoding='UTF-8'?>".$xmlSafeContent;

    } else {
        return htmlspecialchars($xml_content,ENT_NOQUOTES);
    }

}

Usage:

用法:

$body = <<<EG
<?xml version='1.0' encoding='UTF-8'?>
<searchResult count="1">
   <item>
      <title>2016 & Au Rendez-Vous Des Enftheitroad&</title>
   </item>
</searchResult>
EG;
$newXml = sanitizeXML($body);
var_dump($newXml);

Returns:

返回:

<?xml version='1.0' encoding='UTF-8'?>
<searchResult count="1">
    <item>
        <title>2016 &amp; Au Rendez-Vous Des Enftheitroad&amp;</title>
    </item>
</searchResult>