错误:“输入的 UTF-8 不正确,请指出编码!” 使用 PHP 的 simplexml_load_string

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2507608/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 06:44:24  来源:igfitidea点击:

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

phpxmlencodingcharacter-encodingsimplexml

提问by Camsoft

I'm getting the error:

我收到错误:

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

When trying to process an XML response using simplexml_load_stringfrom a 3rd party source. The raw XML response does declare the content type:

尝试使用simplexml_load_string来自 3rd 方源的 XML 响应处理时。原始 XML 响应确实声明了内容类型:

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="UTF-8"?>

Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublínin the XML.

然而,XML 似乎并不是真正的 UTF-8。XML 内容的语言是西班牙语,并包含Dublín与 XML 中类似的词。

I'm unable to get the 3rd party to sort out their XML.

我无法让第 3 方整理他们的 XML。

How can I pre-process the XML and fix the encoding incompatibilities?

如何预处理 XML 并修复编码不兼容问题?

Is there a way to detect the correct encoding for a XML file?

有没有办法检测 XML 文件的正确编码?

回答by Josh Davis

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.

您的 0xED 0x6E 0x2C 0x20 字节对应于 ISO-8859-1 中的“ín,”,所以看起来您的内容是 ISO-8859-1,而不是 UTF-8。告诉你的数据提供者并要求他们修复它,因为如果它对你不起作用,它可能也不适用于其他人。

Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv()or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)

现在有几种方法可以解决这个问题,只有在无法正常加载 XML时才应该使用这些方法。其中之一是使用utf8_encode(). 缺点是,如果该 XML 包含有效的 UTF-8 和一些 ISO-8859-1,那么结果将包含mojibake。或者您可以尝试使用iconv()或 mbstring将字符串从 UTF-8 转换为 UTF-8 ,并希望他们为您修复它。(他们不会,但您至少可以忽略无效字符,以便您可以加载您的 XML)

Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.

或者,您可以走很长很长的路,自己验证/修复序列。这将需要一段时间,具体取决于您对 UTF-8 的熟悉程度。也许那里有图书馆可以做到这一点,尽管我不知道。

Either way, notify your data provider that they're sending invalid data so that they can fix it.

无论哪种方式,通知您的数据提供者他们正在发送无效数据,以便他们可以修复它。



Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.

这是部分修复。它绝对不会解决所有问题,但会解决其中的一些问题。希望足以让您度过难关,直到您的提供商修复他们的东西。

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\xA1-\xFF](?![\x80-\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

回答by Erik

I solved this using

我解决了这个使用

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

回答by befox

If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :

如果您确定您的 xml 以 UTF-8 编码但包含错误字符,您可以使用此函数来更正它们:

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

回答by Paul Blundell

We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control characterin our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.

我们最近遇到了类似的问题,但找不到任何明显的原因。结果证明我们的字符串中有一个控制字符,但是当我们将该字符串输出到浏览器时,该字符是不可见的,除非我们将文本复制到 IDE 中。

We managed to solve our problem thanks to this postand this:

由于这篇文章和这个,我们设法解决了我们的问题:

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

回答by Chango

Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:

您可以简单地将这行代码放在 mysql_connect 语句之后,而不是使用 javascript:

mysql_set_charset('utf8',$connection);

Cheers.

干杯。

回答by skr

If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)

如果您下载 XML 文件并在例如 Notepad++ 中打开它,您会看到编码设置为 UTF8 以外的其他内容 - 我自己制作的 xml 也遇到了同样的问题,它只是编辑器中的编码 :)

String <?xml version="1.0" encoding="UTF-8"?>don't set up the encoding of the document, it's only info for validator or another resource.

字符串<?xml version="1.0" encoding="UTF-8"?>不设置文档的编码,它只是验证器或其他资源的信息。

回答by Pekka

Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.

您能在 Firefox 中打开第 3 方 XML 源代码并查看它自动检测到的编码内容吗?也许他们正在使用普通的旧 ISO-8859-1、UTF-16 或其他东西。

If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).

但是,如果他们将其声明为 UTF-8,并提供其他内容,则他们的提要显然已损坏。处理这样一个损坏的提要对我来说感觉很糟糕(即使有时不可避免,我知道)。

If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().

如果是“UTF-8 与 ISO-8859-1”这样的简单案例,您也可以使用mb_detect_encoding()试试运气。

回答by paragbaxi

I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.

我刚遇到这个问题。结果发现 XML 文件(不是内容)不是用 utf-8 编码的,而是用 ISO-8859-1 编码的。您可以在 Mac 上使用file -I xml_filename.

I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.

我使用 Sublime 将文件编码更改为 utf-8,并且 lxml 导入它没有问题。

回答by George John

After several tries i found htmlentities function works.

经过多次尝试,我发现 htmlentities 功能有效。

$value = htmlentities($value)

回答by Tim Lieberman

When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

使用学说生成映射文件时,我遇到了同样的问题。我通过删除数据库中某些字段的所有注释来修复它。