PHP DOMDocument loadHTML 未正确编码 UTF-8

Question

提问by Slightly A.

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

我正在尝试使用 DOMDocument 解析一些 HTML，但是当我这样做时，我突然丢失了我的编码（至少在我看来是这样）。

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

这段代码的结果是我得到了一堆不是日语的字符。但是，如果我这样做：

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

它显示正确。我试过 saveHTML 和 saveXML，但都不能正确显示。我正在使用 PHP 5.3。

What I see:

我所看到的：

?¤?a??¤?·?·???′???|??￠?¤?????3??3???????o-???9?oo?????5?a???¨??|??????????????|4?oo???3?a???a?￡????è|a?ˉ?¨??????????1??3?§??ˉè|a?ˉéμ????±????￠??¤??? ?￡??é?? ????￡?ˉ?-?￡??￡???￠????¤????¤?????è2è3é????a??????a??ˉ?3???é?? ???é2?-|?

What should be shown:

应该显示什么：

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

编辑：我已将代码简化为五行，以便您可以自己测试。

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

这是返回的html：

<div lang="ja"><p>??¤??a?????¤?·???·?????′???|???￠??¤????????3??‰?3???????o-???</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>

Answer 1

回答by cmbuckley

DOMDocument::loadHTMLwill treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

DOMDocument::loadHTML除非您另有说明，否则会将您的字符串视为在 ISO-8859-1 中。这会导致 UTF-8 字符串被错误解释。

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

如果您的字符串不包含 XML 编码声明，您可以在前面添加一个以使字符串被视为 UTF-8：

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocumentwhich should help you:

如果您不知道字符串是否已经包含这样的声明，那么SmartDOMDocument 中有一个解决方法可以帮助您：

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

这不是一个很好的解决方法，但由于并非所有字符都可以在 ISO-8859-1 中表示（如这些武士刀），因此它是最安全的选择。

Answer 2

回答by Greeso

The problem is with saveHTML()and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

问题在于saveHTML()and saveXML()，它们在 Unix 中都不能正常工作。它们在 Unix 中使用时不能正确保存 UTF-8 字符，但它们在 Windows 中工作。

The workaround is very simple:

解决方法非常简单：

If you try the default, you will get the error you described

如果你尝试默认，你会得到你描述的错误

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

您所要做的就是保存如下：

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().

这行代码将使您的 UTF-8 字符正确保存。如果您使用的是相同的解决方法saveXML()。

Update

更新

As suggested by "Hyman M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:

正如下面评论部分中的“ Hyman M”所建议的，并由“ Pamela”和“ Marco Aurélio Deleu”验证，以下变体可能适用于您的情况：

$str = utf8_decode($dom->saveHTML($dom->documentElement));

Note

笔记

English characters do not cause any problem when you use saveHTML()without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

英文字符不saveHTML()带参数使用不会有问题（因为英文字符在UTF-8中保存为单字节字符）
当您有多字节字符（如中文、俄语、阿拉伯语、希伯来语等）时会出现问题

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

我推荐阅读这篇文章：http: //coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/。您将了解 UTF-8 的工作原理以及为什么会出现此问题。这将花费您大约 30 分钟，但这是值得的。

Answer 3

回答by Hossein

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

确保将真正的源文件保存为 UTF-8（您甚至可能想尝试使用 UTF-8 的非推荐 BOM 字符以确保）。

Also in case of HTML, make sure you have declared the correct encoding using metatags:

同样在 HTML 的情况下，请确保您已经使用meta标签声明了正确的编码：

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

如果它是 CMS（因为您已经用 Joomla 标记了您的问题），您可能需要为编码配置适当的设置。

Answer 4

回答by Ivan

You could prefix a line enforcing utf-8encoding, like this:

您可以为强制utf-8编码的行添加前缀，如下所示：

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

然后您可以继续使用您已有的代码，例如：

$doc->saveXML()

Answer 5

回答by Sam

This took me a while to figure out but here's my answer.

这花了我一段时间才弄明白，但这是我的答案。

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

在使用 DomDocument 之前，我会使用 file_get_contents 来检索 url，然后用字符串函数处理它们。也许不是最好的方法，但很快。在确信 Dom 和我一样快后，我首先尝试了以下操作：

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

尽管有适当的元标记、php 设置以及此处和其他地方提供的所有其他补救措施，但在保留 UTF-8 编码方面却失败了。这是有效的：

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world. Hope this helps.

等等。现在世界上一切正常。希望这可以帮助。

Answer 6

回答by Lazaros Kosmidis

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

您必须为 DOMDocument 提供一个带有合理标题的 HTML 版本。就像 HTML5 一样。

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

也许让您的 html 尽可能有效是一个好主意，这样您就不会在开始查询时遇到问题……大约 :-) 并远离htmlentities!!!! 这是一个必要的来回浪费资源。保持你的代码疯狂！！！！

Answer 7

回答by mMo

Works finde for me:

对我有用：

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

Answer 8

回答by sajed zarrinpour

I am using php 7.3.8 on a manjaro and I was working with Persian content. Thissolved my problem:

我在 manjaro 上使用 php 7.3.8 并且我正在处理波斯内容。这解决了我的问题：

$html = 'hi</b><p>????<div>の家庭に、9 ☆';
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL;

Answer 9

回答by Alexander Goncharov

Use it for correct result

使用它来获得正确的结果

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

这个操作

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

这是不好的方式，因为像 < 这样的特殊符号 , > 可以在 $profile 中，并且它们不会在 mb_convert_encoding 后转换两次。它是 XSS 和不正确的 HTML 的漏洞。

Answer 10

回答by Luke Madhanga

The only thing that worked for me was the accepted answer of

唯一对我有用的是接受的答案

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

HOWEVER

然而

This brought about new issues, of having <?xml encoding="utf-8" ?>in the output of the document.

这带来<?xml encoding="utf-8" ?>了文档输出中的新问题。

The solution for me was then to do

我的解决方案是然后做

foreach ($doc->childNodes as $xx) {
    if ($xx instanceof \DOMProcessingInstruction) {
        $xx->parentNode->removeChild($xx);
    }
}

Some solutions told me that to remove the xmlheader, that I had to perform

一些解决方案告诉我要删除xml标题，我必须执行

$dom->saveXML($dom->documentElement);

This didn't work for me as for a partial document (e.g. a doc with two <p>tags), only one of the <p>tags where being returned.

对于部分文档（例如带有两个<p>标签的文档），这对我不起作用，只有一个<p>标签被返回。

PHP DOMDocument loadHTML 未正确编码 UTF-8

提问by Slightly A.

回答by cmbuckley

回答by Greeso

Note

笔记

回答by Hossein

回答by Ivan

回答by Sam

回答by Lazaros Kosmidis

回答by mMo

回答by sajed zarrinpour

回答by Alexander Goncharov

回答by Luke Madhanga

相关推荐

最近更新

标签

PHP DOMDocument loadHTML 未正确编码 UTF-8

提问by Slightly A.

回答by cmbuckley

回答by Greeso

Note

笔记

回答by Hossein

回答by Ivan

回答by Sam

回答by Lazaros Kosmidis

回答by mMo

回答by sajed zarrinpour

回答by Alexander Goncharov

回答by Luke Madhanga

相关推荐

从 HTML PHP 生成 PDF

php Doctrine 和 LIKE 查询

php 是否可以执行比此短的“如果文件存在则追加，否则创建新文件”

在 PHP 中添加分钟到日期时间

相关推荐

最近更新

标签