PHP DOMDocument loadHTML 未正确编码 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8218230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PHP DOMDocument loadHTML not encoding UTF-8 correctly
提问by Slightly A.
I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
我正在尝试使用 DOMDocument 解析一些 HTML,但是当我这样做时,我突然丢失了我的编码(至少在我看来是这样)。
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
这段代码的结果是我得到了一堆不是日语的字符。但是,如果我这样做:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
它显示正确。我试过 saveHTML 和 saveXML,但都不能正确显示。我正在使用 PHP 5.3。
What I see:
我所看到的:
?¤?a??¤?·?·???′???|??¢?¤?????3??3???????o-???9?oo?????5?a???¨??|??????????????|4?oo???3?a???a?£????è|a?ˉ?¨??????????1??3?§??ˉè|a?ˉéμ????±????¢??¤??? ?£??é?? ????£?ˉ?-?£??£???¢????¤????¤?????è2è3é????a??????a??ˉ?3???é?? ???é2?-|?
What should be shown:
应该显示什么:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
编辑:我已将代码简化为五行,以便您可以自己测试。
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
这是返回的html:
<div lang="ja"><p>??¤??a?????¤?·???·?????′???|???¢??¤????????3??‰?3???????o-???</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
回答by cmbuckley
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
DOMDocument::loadHTML
除非您另有说明,否则会将您的字符串视为在 ISO-8859-1 中。这会导致 UTF-8 字符串被错误解释。
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
如果您的字符串不包含 XML 编码声明,您可以在前面添加一个以使字符串被视为 UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocumentwhich should help you:
如果您不知道字符串是否已经包含这样的声明,那么SmartDOMDocument 中有一个解决方法可以帮助您:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
这不是一个很好的解决方法,但由于并非所有字符都可以在 ISO-8859-1 中表示(如这些武士刀),因此它是最安全的选择。
回答by Greeso
The problem is with saveHTML()
and saveXML()
, both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
问题在于saveHTML()
and saveXML()
,它们在 Unix 中都不能正常工作。它们在 Unix 中使用时不能正确保存 UTF-8 字符,但它们在 Windows 中工作。
The workaround is very simple:
解决方法非常简单:
If you try the default, you will get the error you described
如果你尝试默认,你会得到你描述的错误
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
您所要做的就是保存如下:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML()
.
这行代码将使您的 UTF-8 字符正确保存。如果您使用的是相同的解决方法saveXML()
。
Update
更新
As suggested by "Hyman M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:
正如下面评论部分中的“ Hyman M”所建议的,并由“ Pamela”和“ Marco Aurélio Deleu”验证,以下变体可能适用于您的情况:
$str = utf8_decode($dom->saveHTML($dom->documentElement));
Note
笔记
English characters do not cause any problem when you use
saveHTML()
without parameters (because English characters are saved as single byte characters in UTF-8)The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
英文字符不
saveHTML()
带参数使用不会有问题(因为英文字符在UTF-8中保存为单字节字符)当您有多字节字符(如中文、俄语、阿拉伯语、希伯来语等)时会出现问题
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
我推荐阅读这篇文章:http: //coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/。您将了解 UTF-8 的工作原理以及为什么会出现此问题。这将花费您大约 30 分钟,但这是值得的。
回答by Hossein
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
确保将真正的源文件保存为 UTF-8(您甚至可能想尝试使用 UTF-8 的非推荐 BOM 字符以确保)。
Also in case of HTML, make sure you have declared the correct encoding using meta
tags:
同样在 HTML 的情况下,请确保您已经使用meta
标签声明了正确的编码:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
如果它是 CMS(因为您已经用 Joomla 标记了您的问题),您可能需要为编码配置适当的设置。
回答by Ivan
You could prefix a line enforcing utf-8
encoding, like this:
您可以为强制utf-8
编码的行添加前缀,如下所示:
@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
然后您可以继续使用您已有的代码,例如:
$doc->saveXML()
回答by Sam
This took me a while to figure out but here's my answer.
这花了我一段时间才弄明白,但这是我的答案。
Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
在使用 DomDocument 之前,我会使用 file_get_contents 来检索 url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在确信 Dom 和我一样快后,我首先尝试了以下操作:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:
尽管有适当的元标记、php 设置以及此处和其他地方提供的所有其他补救措施,但在保留 UTF-8 编码方面却失败了。这是有效的:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world. Hope this helps.
等等。现在世界上一切正常。希望这可以帮助。
回答by Lazaros Kosmidis
You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.
您必须为 DOMDocument 提供一个带有合理标题的 HTML 版本。就像 HTML5 一样。
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities
!!!! That's an an necessary back and forth wasting resources.
keep your code insane!!!!
也许让您的 html 尽可能有效是一个好主意,这样您就不会在开始查询时遇到问题……大约 :-) 并远离htmlentities
!!!! 这是一个必要的来回浪费资源。保持你的代码疯狂!!!!
回答by mMo
Works finde for me:
对我有用:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
回答by sajed zarrinpour
I am using php 7.3.8 on a manjaro and I was working with Persian content. Thissolved my problem:
我在 manjaro 上使用 php 7.3.8 并且我正在处理波斯内容。这解决了我的问题:
$html = 'hi</b><p>????<div>の家庭に、9 ☆';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL;
回答by Alexander Goncharov
Use it for correct result
使用它来获得正确的结果
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
这个操作
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
这是不好的方式,因为像 < 这样的特殊符号 , > 可以在 $profile 中,并且它们不会在 mb_convert_encoding 后转换两次。它是 XSS 和不正确的 HTML 的漏洞。
回答by Luke Madhanga
The only thing that worked for me was the accepted answer of
唯一对我有用的是接受的答案
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
HOWEVER
然而
This brought about new issues, of having <?xml encoding="utf-8" ?>
in the output of the document.
这带来<?xml encoding="utf-8" ?>
了文档输出中的新问题。
The solution for me was then to do
我的解决方案是然后做
foreach ($doc->childNodes as $xx) {
if ($xx instanceof \DOMProcessingInstruction) {
$xx->parentNode->removeChild($xx);
}
}
Some solutions told me that to remove the xml
header, that I had to perform
一些解决方案告诉我要删除xml
标题,我必须执行
$dom->saveXML($dom->documentElement);
This didn't work for me as for a partial document (e.g. a doc with two <p>
tags), only one of the <p>
tags where being returned.
对于部分文档(例如带有两个<p>
标签的文档),这对我不起作用,只有一个<p>
标签被返回。