PHP DomDocument 无法处理 utf-8 字符 (☆)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11309194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PHP DomDocument failing to handle utf-8 characters (☆)
提问by Greg
The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.
网络服务器以 utf-8 编码提供响应,所有文件都以 utf-8 编码保存,我所知道的所有设置都已设置为 utf-8 编码。
Here's a quick program, to test if the output works:
这是一个快速程序,用于测试输出是否有效:
<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;
$dom = new DomDocument("1.0", "utf-8");
$dom->loadHTML($html);
header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());
The output of the program is:
程序的输出是:
<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
<h1>☆ Hello ☆ World ☆</h1>
</body></html>
Which renders as:
呈现为:
a?? Hello a?? World a??
一种??你好??世界??
What could I be doing wrong? How much more specific do I have to be to tell the DomDocument to handle utf-8 properly?
我可能做错了什么?要告诉 DomDocument 正确处理 utf-8,我需要具体多少?
回答by hakre
DOMDocument::loadHTML()expects a HTML string.
DOMDocument::loadHTML()需要一个 HTML 字符串。
HTML uses the ISO-8859-1encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252in common webbrowsers.
HTML 使用ISO-8859-1编码(ISO Latin Alphabet No. 1)作为其规范的默认编码。那是因为更长,见6.1。HTML 文档字符集。实际上,这更多是Windows-1252普通网络浏览器的默认支持。
I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparserwhich is designed for HTML 4.0.
我回过头来是因为 PHP 的 DOMDocument 基于 libxml 并且带来了专为 HTML 4.0 设计的HTMLparser。
I'd say it's safe to assume then that you can load an ISO-8859-1encoded string.
我会说可以安全地假设你可以加载一个ISO-8859-1编码的字符串。
Your string is UTF-8encoded. Turn all characters higher than 127 / h7F into HTML Entitiesand you're fine. If you don't want to do that your own, that is what mb_convert_encodingwith the HTML-ENTITIEStarget encoding does:
您的字符串已UTF-8编码。将所有高于 127 / h7F 的字符转换为HTML 实体,您就可以了。如果你不想这样做你自己,这就是mb_convert_encoding与HTML-ENTITIES目标编码的作用:
- Those characters that have named entities, will get the named entitiy.
-> € - The others get their numeric (decimal) entity, e.g.
☆ -> ☆
- 那些具有命名实体的字符将获得命名实体。
-> € - 其他人获得他们的数字(十进制)实体,例如
☆ -> ☆
The following is a code example that makes the progress a bit more visible by using a callback function:
下面是一个代码示例,它通过使用回调函数使进度更加明显:
$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
list($utf8) = $match;
$entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
printf("%s -> %s\n", $utf8, $entity);
return $entity;
}, $html);
This exemplary outputs for your string:
您的字符串的示例输出:
☆ -> ☆
☆ -> ☆
☆ -> ☆
Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTMLcan deal with. That can be done by converting all outside of US-ASCIIinto HTML Entities:
无论如何,这只是为了更深入地了解您的字符串。你想把它要么转换成编码loadHTML就可以处理。这可以通过将所有外部转换US-ASCII为 HTML 实体来完成:
$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');
Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encodingcan only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.
请注意您的输入实际上是 UTF-8 编码的。如果您甚至有混合编码(某些输入可能会发生这种情况),mb_convert_encoding则每个字符串只能处理一种编码。我已经在上面概述了如何在正则表达式的帮助下更具体地进行字符串替换,所以我现在留下更多的细节。
The other alternative is to hintthe encoding. This can be done in your case by modifying the document and adding a
另一种选择是提示编码。这可以通过修改文档并添加一个
<meta http-equiv="content-type" content="text/html; charset=utf-8">
which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.
这是一个指定字符集的内容类型。对于无法通过网络服务器使用的 HTML 字符串(例如,保存在磁盘上或在您的示例中的字符串中),这也是最佳实践。网络服务器通常将其设置为响应标头。
If you don't care the misplaced warnings, you can just add it in front of the string:
如果你不关心错位的警告,你可以将它添加到字符串前面:
$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);
Per the HTML 2.0 specs, elements that can only appear in the <head>section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):
根据 HTML 2.0 规范,只能出现在<head>文档部分的元素将自动放置在那里。这也是这里发生的事情。输出(漂亮的打印):
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
回答by DeZeA
There's a faster fix for that, after loading your html document in DOMDocument, you just set (or better said reset) the original encoding. Here's a sample code:
有一个更快的解决方法,在 DOMDocument 中加载您的 html 文档后,您只需设置(或者更好的说法是重置)原始编码。这是一个示例代码:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
foreach ($dom->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$dom->removeChild($item);
$dom->encoding = 'UTF-8'; // reset original encoding
回答by Vladimir Kadalashvili
<?php
header("Content-type: text/html; charset=utf-8");
$html = <<<HTML
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Test!</title>
</head>
<body>
<h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$dom = new DomDocument("1.0", "utf-8");
$dom->loadHTML($html);
header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());
Output:
输出:
<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
<h1>☆ Hello ☆ World ☆</h1>
</body></html>

