php 我如何告诉 DOMDocument->load() 我希望它使用什么编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1269485/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 01:49:20  来源:igfitidea点击:

How do I tell DOMDocument->load() what encoding I want it to use?

phpxmldomdomdocument

提问by

I search for and process XML files from elsewhere, and need to transform them with some XSLTs. No problem. Using PHP5 and the DOM library, everything's a snap. Worked fine, up till now. Today, funky characters were in the XML file -- "smart" quotes from Word, it looks like. Anyways, DOMDocument->load complained about them, saying that they weren't UTF-8, and to specify the encoding.

我从别处搜索和处理 XML 文件,需要用一些 XSLT 转换它们。没问题。使用 PHP5 和 DOM 库,一切都变得轻而易举。工作得很好,直到现在。今天,时髦的字符出现在 XML 文件中——Word 中的“智能”引号,看起来像。无论如何,DOMDocument->load 抱怨他们,说他们不是 UTF-8,并指定编码。

Lo and behold, the encoding is not specified in these XML files. If I add in 'encoding="iso-8859-1"' to the header, it works fine. The rub is I have no control over these XML files.

瞧,这些 XML 文件中没有指定编码。如果我将 'encoding="iso-8859-1"' 添加到标题中,它工作正常。问题是我无法控制这些 XML 文件。

Reading the file into a string, modifying its header and writing it back out to another location seems to be my only option, but I'd prefer to do it without having to use temporary copies of the XML files at all. Is there any way to simply tell the parser to parse them as if they were iso-8859-1?

将文件读入字符串、修改其标头并将其写回另一个位置似乎是我唯一的选择,但我更愿意这样做而根本不必使用 XML 文件的临时副本。有什么方法可以简单地告诉解析器像 iso-8859-1 一样解析它们?

回答by nickf

Does this work for you?

这对你有用吗?

$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->load($xmlPath);

Edit:Since it appears that this doesn't work, what you could do instead is similar to your existing method but without the temp file. Read the XML file from your source just using standard IO operations (file_get_contents()or something), then perform whatever changes to the encoding you need (iconv()or utf8_decode()) and then use loadXML()

编辑:由于这似乎不起作用,您可以执行的操作类似于您现有的方法,但没有临时文件。仅使用标准 IO 操作(file_get_contents()或其他操作)从您的源中读取 XML 文件,然后对您需要的编码(iconv()utf8_decode())执行任何更改,然后使用loadXML()

$myXMLString = file_get_contents($xmlPath);
$myXMLString = utf8_decode($myXMLString);
$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->loadXML($myXMLString);

回答by VolkerK

I haven't found a way to set the default encoding (yet) but maybethe recover mode is feasible in this case.
When libxml encounters an encoding error and no encoding has been explicitly set it switches from unicode/utf8 to latin1 and continues parsing the document. But in the parser context the property wellFormedis set to 0/false. PHP's DOM extension considers the document valid if wellFormedis true orthe DOMDocument object's attribute recoveris true.

我还没有找到设置默认编码的方法(还),但在这种情况下,恢复模式可能是可行的。
当 libxml 遇到编码错误并且没有明确设置编码时,它会从 unicode/utf8 切换到 latin1 并继续解析文档。但在解析器上下文中,该属性wellFormed设置为 0/false。如果wellFormed为真DOMDocument 对象的属性recover为真,PHP 的 DOM 扩展认为文档有效。

<?php
// german Umlaut ? in latin1 = 0xE4
$xml = '<foo>'.chr(0xE4).'</foo>';

$doc = new DOMDocument;
$b = $doc->loadxml($xml);
echo 'with doc->recover=false(default) : ', ($b) ? 'success':'failed', "\n";

$doc = new DOMDocument;
$doc->recover = true;
$b = $doc->loadxml($xml);
echo 'with doc->recover=true : ', ($b) ? 'success':'failed', "\n";

prints

印刷

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in test.php on line 6
with doc->recover=false(default) : failed

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in  test.php on line 11
with doc->recover=true : success

You still get the warning message (which can be suppressed with @$doc->load()) and it will also show up in the internal libxml errors(only once when the parser switches from utf8 to latin1). The error code for this particular error will be 9 (XML_ERR_INVALID_CHAR).

您仍然会收到警告消息(可以用@$doc->load() 抑制),它也会出现在内部 libxml 错误中(仅在解析器从 utf8 切换到 latin1 时出现一次)。此特定错误的错误代码将为 9 (XML_ERR_INVALID_CHAR)。

<?php
$xml = sprintf('<foo>
    <ae>%s</ae>
    <oe>%s</oe>
    &
</foo>', chr(0xE4),chr(0xF6));

libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->recover = true;
libxml_clear_errors();
$b = $doc->loadxml($xml);
$invalidCharFound = false;
foreach(libxml_get_errors() as $error) {
    if ( 9==$error->code && !$invalidCharFound ) {
        $invalidCharFound = true;
        echo "found invalid char, possibly harmless\n";
    }
    else {
        echo "hm, that's probably more severe: ", $error->message, "\n";
    }
}

回答by Alf Eaton

The ony way to specify the encoding is in the XML declaration at the start of the file:

指定编码的唯一方法是在文件开头的 XML 声明中:

<?xml version="1.0" encoding="ISO-8859-1"?>