php 修复损坏的 UTF-8 编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1344692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 02:07:23  来源:igfitidea点击:

Fixing broken UTF-8 encoding

phpmysqlunicodeutf-8

提问by Jayrox

I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.

我正在修复一些错误的 UTF-8 编码。我目前正在使用 PHP 5 和 MySQL。

In my database I have a few instances of bad encodings that print like: ????

在我的数据库中,我有一些错误编码的实例,它们打印如下:????

  • The database collation is utf8_general_ci
  • PHP is using a proper UTF-8 header
  • Notepad++ is set to use UTF-8 without BOM
  • database management is handled in phpMyAdmin
  • not all cases of accented characters are broken
  • 数据库排序规则为 utf8_general_ci
  • PHP 正在使用正确的 UTF-8 标头
  • Notepad++ 设置为使用没有 BOM 的 UTF-8
  • 数据库管理在phpMyAdmin 中处理
  • 并非所有重音字符的情况都被破坏

I need some sort of function that will help me map the instances of ????, ???-, ???? and others like it to their proper accented UTF-8 characters.

我需要某种功能来帮助我映射 ????, ???-, ???? 的实例 和其他人喜欢它的正确重音 UTF-8 字符。

采纳答案by Eli

I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.

过去,我不得不尝试“修复”一些 UTF8 损坏的情况,不幸的是,这绝非易事,而且通常是不可能的。

Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.

除非你能确切地确定它是如何被破坏的,而且它总是以完全相同的方式被破坏,否则很难“消除”损害。

If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.

如果您想尝试消除损坏,最好的办法是开始编写一些示例代码,在那里您尝试对 mb_convert_encoding() 调用进行多种变体,以查看是否可以找到“from”和“to”的组合修复您的数据。最后,通常最好不要因为所涉及的痛苦程度而担心修复旧数据,而只是修复未来的事情。

However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:

但是,在执行此操作之前,您需要确保首先修复导致此问题的所有内容。您已经提到您的数据库表整理和编辑器设置正确。但是还有更多地方需要检查以确保一切都是正确的 UTF-8:

  • Make sure that you are serving your HTML as UTF-8:
    • header("Content-Type: text/html; charset=utf-8");
  • Change your PHP default charset to utf-8:
    • ini_set("default_charset", 'utf-8');
  • If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
    • charset utf8
  • You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
    • AddDefaultCharset UTF-8
  • Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_*styled 'multibyte aware' string functions. It also means when calling functions such as htmlspecialchars(), that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.
  • 确保您以 UTF-8 格式提供 HTML:
    • header("Content-Type: text/html; charset=utf-8");
  • 将您的 PHP 默认字符集更改为 utf-8:
    • ini_set("default_charset", 'utf-8');
  • 如果您的数据库并不总是以 utf-8 进行通信,那么您可能需要在每个连接的基础上告诉它以确保它处于 utf-8 模式,在 MySQL 中,您可以通过发出以下命令来做到这一点:
    • 字符集 utf8
  • 您可能需要告诉您的网络服务器始终尝试使用 UTF8 进行通话,在 Apache 中,此命令是:
    • 添加默认字符集 UTF-8
  • 最后,您需要始终确保使用符合 UTF-8 标准的 PHP 函数。这意味着始终使用mb_*样式的“多字节感知”字符串函数。这也意味着在调用 htmlspecialchars() 等函数时,您应在末尾包含适当的 'utf-8' 字符集参数,以确保它不会错误地对它们进行编码。

If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)

如果您在整个过程中遗漏了任何一步,编码可能会被破坏并出现问题。不过,一旦你进入了 utf-8 的“凹槽”,这一切就变成了第二天性。当然,PHP6 应该是来自 getgo 的完全 unicode 投诉,这将使很多事情变得更容易(希望如此)

回答by jsdalton

If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe a?, quotation mark a?, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

如果您有双重编码的 UTF8 字符(各种智能引号、破折号、撇号 a?、引号 a? 等),您可以在 mysql 中转储数据,然后将其读回以修复损坏的编码。

Like this:

像这样:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
    --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
    --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my double encoded UTF-8.

这是对我的双编码 UTF-8 的 100% 修复。

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

来源:http: //blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

回答by Sebastián Grignoli

If you utf8_encode()on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.

如果您utf8_encode()使用的字符串已经是 UTF-8,那么当它被多次编码时它看起来是乱码。

I made a function toUTF8()that converts strings into UTF-8.

我做了一个toUTF8()将字符串转换为 UTF-8的函数。

You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.

您不需要指定字符串的编码是什么。它可以是 Latin1 (iso 8859-1)、Windows-1252 或 UTF8,或这三者的混合。

I used this myself on a feed with mixed encodings in the same string.

我自己在同一个字符串中混合编码的提要上使用了这个。

Usage:

用法:

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

My other function fixUTF8()fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.

fixUTF8()如果 UTF8 字符串被多次编码为 UTF8,我的其他函数会修复乱码。

Usage:

用法:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

例子:

echo Encoding::fixUTF8("F??d??ration Camerounaise de Football");
echo Encoding::fixUTF8("F???d???ration Camerounaise de Football");
echo Encoding::fixUTF8("F?????d?????ration Camerounaise de Football");
echo Encoding::fixUTF8("F???dération Camerounaise de Football");

will output:

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

下载:

https://github.com/neitanod/forceutf8

https://github.com/neitanod/forceutf8

回答by Celleb

I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the mb_convert_encoding()I manage to fix it with

我遇到了编码损坏的 xml 文件的问题,它说它是 utf-8,但它的字符不是 utf-8。
经过多次试验和错误,mb_convert_encoding()我设法修复它

mb_convert_encoding($text, 'Windows-1252', 'UTF-8')

回答by blueyed

As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.

正如丹指出的那样:您需要将它们转换为二进制,然后转换/更正编码。

E.g., for utf8 stored as latin1 the following SQL will fix it:

例如,对于存储为 latin1 的 utf8,以下 SQL 将修复它:

UPDATE table
   SET field = CONVERT( CAST(field AS BINARY) USING utf8)
 WHERE $broken_field_condition

回答by Jayrox

I know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:

我知道这不是很优雅,但是在提到字符串可能被双重编码之后,我做了这个函数:

function fix_double encoding($string)
{
    $utf8_chars = explode(' ', 'à á ? ? ? ? ? ? è é ê ? ì í ? ? D ? ò ó ? ? ? × ? ù ú ? ü Y T ? à á a ? ? ? ? ? è é ê ? ì í ? ? e ? ò ó ? ? ?');
    $utf8_double_encoded = array();
    foreach($utf8_chars as $utf8_char)
    {
            $utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char));
    }
    $string = str_replace($utf8_double_encoded, $utf8_chars, $string);
    return $string;
}

This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.

这似乎可以完美地消除我遇到的双重编码。我可能遗漏了一些可能对其他人造成问题的角色。但是,对于我的需要,它运行良好。

回答by Dan

The way is to convert to binary and then to correct encoding

方法是先转成二进制再正确编码

回答by Luke Madhanga

Another thing to check, which happened to be my solution (found here), is how data is being returned from your server. In my application, I'm using PDO to connect from PHP to MySQL. I needed to add a flag to the connection which said get the data back in UTF-8 format

另一件要检查的事情,恰好是我的解决方案(在此处找到),是如何从您的服务器返回数据。在我的应用程序中,我使用 PDO 从 PHP 连接到 MySQL。我需要在连接中添加一个标志,表示以 UTF-8 格式获取数据

The answer was

答案是

$dbHandle = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8", $dbUser, $dbPass, 
    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'"));

回答by Jordan Daigle

In my case, I found out by using "mb_convert_encoding" that the previous encoding was iso-8859-1 (which is latin1) then I fixed my problem by using an sql query :

就我而言,我通过使用“ mb_convert_encoding”发现以前的编码是 iso-8859-1(即 latin1),然后我使用 sql 查询解决了我的问题:

UPDATE myDB.myTable SET myColumn = CAST(CAST(CONVERT(myColumn USING latin1) AS binary) AS CHAR)

However, it is stated in the mysql documentations that conversion may be lossy if the column contains characters that are not in both character sets.

但是,在 mysql 文档中指出,如果该列包含不在两个字符集中的字符,则转换可能是有损的。

回答by Erik Aronesty

This script had a nice approach. Converting it to the language of your choice should not be too difficult:

这个脚本有一个很好的方法。将其转换为您选择的语言应该不会太难:

http://plasmasturm.org/log/416/

http://plasmasturm.org/log/416/

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( decode FB_QUIET );

binmode STDIN, ':bytes';
binmode STDOUT, ':encoding(UTF-8)';

my $out;

while ( <> ) {
  $out = '';
  while ( length ) {
    # consume input string up to the first UTF-8 decode error
    $out .= decode( "utf-8", $_, FB_QUIET );
    # consume one character; all octets are valid Latin-1
    $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
  }
  print $out;
}