PHP DOMDocument loadHTML 未正确编码 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8218230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 04:10:57  来源:igfitidea点击:

PHP DOMDocument loadHTML not encoding UTF-8 correctly

phputf-8character-encoding

提问by Slightly A.

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

我正在尝试使用 DOMDocument 解析一些 HTML,但是当我这样做时,我突然丢失了我的编码(至少在我看来是这样)。

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

这段代码的结果是我得到了一堆不是日语的字符。但是,如果我这样做:

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

它显示正确。我试过 saveHTML 和 saveXML,但都不能正确显示。我正在使用 PHP 5.3。

What I see:

我所看到的:

?¤?a??¤?·?·???′???|??¢?¤?????3??3???????o-???9?oo?????5?a???¨??|??????????????|4?oo???3?a???a?£????è|a?ˉ?¨??????????1??3?§??ˉè|a?ˉéμ????±????¢??¤??? ?£??é?? ????£?ˉ?-?£??£???¢????¤????¤?????è2è3é????a??????a??ˉ?3???é?? ???é2?-|?

What should be shown:

应该显示什么:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

编辑:我已将代码简化为五行,以便您可以自己测试。

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

这是返回的html:

<div lang="ja"><p>??¤??a?????¤?·???·?????′???|???¢??¤????????3??‰?3???????o-???</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>

回答by cmbuckley

DOMDocument::loadHTMLwill treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

DOMDocument::loadHTML除非您另有说明,否则会将您的字符串视为在 ISO-8859-1 中。这会导致 UTF-8 字符串被错误解释。

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

如果您的字符串不包含 XML 编码声明,您可以在前面添加一个以使字符串被视为 UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocumentwhich should help you:

如果您不知道字符串是否已经包含这样的声明,那么SmartDOMDocument 中有一个解决方法可以帮助您:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

这不是一个很好的解决方法,但由于并非所有字符都可以在 ISO-8859-1 中表示(如这些武士刀),因此它是最安全的选择。

回答by Greeso

The problem is with saveHTML()and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

问题在于saveHTML()and saveXML(),它们在 Unix 中都不能正常工作。它们在 Unix 中使用时不能正确保存 UTF-8 字符,但它们在 Windows 中工作。

The workaround is very simple:

解决方法非常简单:

If you try the default, you will get the error you described

如果你尝试默认,你会得到你描述的错误

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

您所要做的就是保存如下:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().

这行代码将使您的 UTF-8 字符正确保存。如果您使用的是相同的解决方法saveXML()



Update

更新

As suggested by "Hyman M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:

正如下面评论部分中的“ Hyman M”所建议的,并由“ Pamela”和“ Marco Aurélio Deleu”验证,以下变体可能适用于您的情况:

$str = utf8_decode($dom->saveHTML($dom->documentElement));


Note

笔记

  1. English characters do not cause any problem when you use saveHTML()without parameters (because English characters are saved as single byte characters in UTF-8)

  2. The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

  1. 英文字符不saveHTML()带参数使用不会有问题(因为英文字符在UTF-8中保存为单字节字符)

  2. 当您有多字节字符(如中文、俄语、阿拉伯语、希伯来语等)时会出现问题

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

我推荐阅读这篇文章:http: //coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/。您将了解 UTF-8 的工作原理以及为什么会出现此问题。这将花费您大约 30 分钟,但这是值得的。

回答by Hossein

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

确保将真正的源文件保存为 UTF-8(您甚至可能想尝试使用 UTF-8 的非推荐 BOM 字符以确保)。

Also in case of HTML, make sure you have declared the correct encoding using metatags:

同样在 HTML 的情况下,请确保您已经使用meta标签声明了正确的编码:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

如果它是 CMS(因为您已经用 Joomla 标记了您的问题),您可能需要为编码配置适当的设置。

回答by Ivan

You could prefix a line enforcing utf-8encoding, like this:

您可以为强制utf-8编码的行添加前缀,如下所示:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

然后您可以继续使用您已有的代码,例如:

$doc->saveXML()

回答by Sam

This took me a while to figure out but here's my answer.

这花了我一段时间才弄明白,但这是我的答案。

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

在使用 DomDocument 之前,我会使用 file_get_contents 来检索 url,然后用字符串函数处理它们。也许不是最好的方法,但很快。在确信 Dom 和我一样快后,我首先尝试了以下操作:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

尽管有适当的元标记、php 设置以及此处和其他地方提供的所有其他补救措施,但在保留 UTF-8 编码方面却失败了。这是有效的:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world. Hope this helps.

等等。现在世界上一切正常。希望这可以帮助。

回答by Lazaros Kosmidis

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

您必须为 DOMDocument 提供一个带有合理标题的 HTML 版本。就像 HTML5 一样。

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

也许让您的 html 尽可能有效是一个好主意,这样您就不会在开始查询时遇到问题……大约 :-) 并远离htmlentities!!!! 这是一个必要的来回浪费资源。保持你的代码疯狂!!!!

回答by mMo

Works finde for me:

对我有用:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

回答by sajed zarrinpour

I am using php 7.3.8 on a manjaro and I was working with Persian content. Thissolved my problem:

我在 manjaro 上使用 php 7.3.8 并且我正在处理波斯内容。解决了我的问题:

$html = 'hi</b><p>????<div>の家庭に、9 ☆';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL;

回答by Alexander Goncharov

Use it for correct result

使用它来获得正确的结果

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

这个操作

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

这是不好的方式,因为像 < 这样的特殊符号 , > 可以在 $profile 中,并且它们不会在 mb_convert_encoding 后转换两次。它是 XSS 和不正确的 HTML 的漏洞。

回答by Luke Madhanga

The only thing that worked for me was the accepted answer of

唯一对我有用的是接受的答案

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

HOWEVER

然而

This brought about new issues, of having <?xml encoding="utf-8" ?>in the output of the document.

这带来<?xml encoding="utf-8" ?>了文档输出中的新问题。

The solution for me was then to do

我的解决方案是然后做

foreach ($doc->childNodes as $xx) {
    if ($xx instanceof \DOMProcessingInstruction) {
        $xx->parentNode->removeChild($xx);
    }
}

Some solutions told me that to remove the xmlheader, that I had to perform

一些解决方案告诉我要删除xml标题,我必须执行

$dom->saveXML($dom->documentElement);

This didn't work for me as for a partial document (e.g. a doc with two <p>tags), only one of the <p>tags where being returned.

对于部分文档(例如带有两个<p>标签的文档),这对我不起作用,只有一个<p>标签被返回。