php 为什么 DOM 会改变编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2236889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why Does DOM Change Encoding?
提问by Richard Knop
$string = file_get_contents('http://example.com');
if ('UTF-8' === mb_detect_encoding($string)) {
$dom = new DOMDocument();
// hack to preserve UTF-8 characters
$dom->loadHTML('<?xml encoding="UTF-8">' . $string);
$dom->preserveWhiteSpace = false;
$dom->encoding = 'UTF-8';
$body = $dom->getElementsByTagName('body');
echo htmlspecialchars($body->item(0)->nodeValue);
}
This changes all UTF-8 characters to ?, ?, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?
这会将所有 UTF-8 字符更改为 ?、?、¤ 和其他垃圾。有没有其他方法可以保留 UTF-8 字符?
Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.
不要发布答案告诉我确保我将其输出为 UTF-8,我确定我是。
Thanks in advance :)
提前致谢 :)
回答by andrewmabbott
I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html
我最近遇到了类似的问题,最终找到了这个解决方法 - 在加载 html 之前将所有非 ascii 字符转换为 html 实体
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);
回答by Pekka
In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument shouldbe UTF-8 by default in any case but you can still try:
万一确实是 DOM 搞砸了编码,这个技巧对我来说是反过来的(接受 ISO-8859-1 数据)。DOMDocument在任何情况下都应默认为 UTF-8,但您仍然可以尝试:
$dom = new DOMDocument('1.0', 'utf-8');
回答by goat
At the top of the script where your php code lies(the code you posted here), make sure you send a utf-8 header. I bet your encoding is a some variant of latin1 right now. Yes, I know the remote webpage is utf8, but this php script isn't.
在您的 php 代码所在的脚本顶部(您在此处发布的代码),确保您发送了一个 utf-8 标头。我敢打赌,您的编码现在是 latin1 的某种变体。是的,我知道远程网页是 utf8,但这个 php 脚本不是。
回答by fty4
I had to add a utf8 header to get the correct view:
我必须添加一个 utf8 标头才能获得正确的视图:
header('Content-Type: text/html; charset=utf-8');

