php UTF8 编码问题 - 有很好的例子
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4095899/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF8 Encoding problem - With good examples
提问by Lizard
I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.
我有以下字符编码问题,不知何故我设法将具有不同字符编码的数据保存到我的数据库 (UTF8) 下面的代码和输出显示了 2 个示例字符串及其输出方式。其中 1 个需要更改为 UTF8,另一个已经是。
How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?
我如何/应该如何检查是否应该对字符串进行编码?例如,我需要正确输出每个字符串,那么如何检查它是否已经是 utf8 或是否需要转换?
I am using PHP 5.2, mysql myisam tables:
我正在使用 PHP 5.2,mysql myisam 表:
CREATE TABLE IF NOT EXISTS `entities` (
....
`title` varchar(255) NOT NULL
....
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'UTF8 Encode : ', utf8_encode($text)."<br />";
echo 'UTF8 Decode : ', utf8_decode($text)."<br />";
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />";
echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />";
echo 'IGNORE : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />";
echo 'Plain : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />";
?>
Output 1:
输出 1:
Original : France Télécom
UTF8 Encode : France T??l??com
UTF8 Decode : France T?l?com
TRANSLIT : France T??l??com
IGNORE TRANSLIT : France T??l??com
IGNORE : France T??l??com
Plain : France T??l??com
Output 2:###
输出 2:###
Original : Cond? Nast Publications
UTF8 Encode : Condé Nast Publications
UTF8 Decode : Cond?ast Publications
TRANSLIT : Condé Nast Publications
IGNORE TRANSLIT : Condé Nast Publications
IGNORE : Condé Nast Publications
Plain : Condé Nast Publications
Thanks for you time on this one. Character encoding and I don't get on very well!
谢谢你花时间在这个上。字符编码和我相处得不太好!
UPDATE:
更新:
echo strlen($string)."|".strlen(utf8_encode($string))."|";
echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string);
echo "<br />";
echo strlen($string)."|".strlen(utf8_decode($string))."|";
echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string);
echo "<br />";
23|24|Cond? Nast Publications
23|21|Cond? Nast Publications
16|20|France Télécom
16|14|France Télécom
回答by Pekka
This may be a job for the mb_detect_encoding()
function.
这可能是该mb_detect_encoding()
功能的工作。
In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it shouldwork.
根据我对它的有限经验,当用作通用“编码嗅探器”时,它不是 100% 可靠的——它检查某些字符和字节值的存在以进行有根据的猜测——但在这种狭隘的情况下(它需要区分 UTF-8 和 ISO-8859-1 )它应该可以工作。
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected encoding '.$enc."<br />";
echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";
?>
you may get incorrect results for strings that do not contain special characters, but that is not a problem.
对于不包含特殊字符的字符串,您可能会得到不正确的结果,但这不是问题。
回答by Sebastián Grignoli
I made a function that addresses all this issues. It′s called Encoding::toUTF8().
我做了一个函数来解决所有这些问题。它被称为 Encoding::toUTF8()。
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>
Output:
输出:
Original : France Télécom
Encoding::toUTF8 : France Télécom
Original : Cond? Nast Publications
Encoding::toUTF8 : Condé Nast Publications
You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.
你不需要知道你的字符串的编码是什么,只要你知道它是在 Latin1 (iso 8859-1)、Windows-1252 或 UTF8 上。字符串也可以混合使用它们。
Encoding::toUTF8() will convert everything to UTF8.
Encoding::toUTF8() 会将所有内容转换为 UTF8。
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
我这样做是因为一个服务给了我一个全乱七八糟的数据源,在同一个字符串中混合了 UTF8 和 Latin1。
Usage:
用法:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
下载:
http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip
http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip
I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.
我已经包含了另一个函数,Encoding::fixUFT8(),它将修复每个看起来乱码的 UTF8 字符串。
Usage:
用法:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
例子:
echo Encoding::fixUTF8("F??d??ration Camerounaise de Football");
echo Encoding::fixUTF8("F???d???ration Camerounaise de Football");
echo Encoding::fixUTF8("F?????d?????ration Camerounaise de Football");
echo Encoding::fixUTF8("F???dération Camerounaise de Football");
will output:
将输出:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
回答by Dr.Molle
Another way, maybe faster and less unreliable:
另一种方式,也许更快,更不可靠:
echo (strlen($str)!==strlen(utf8_decode($str)))
? $str //is multibyte, leave as is
: utf8_encode($str); //encode
It compares the length of the original string and the utf8_decoded string. A string that contains a multibyte-character, has a strlen which differs from the similar singlebyte-encoded strlen.
它比较原始字符串和 utf8_decoded 字符串的长度。包含多字节字符的字符串具有不同于类似的单字节编码的 strlen 的 strlen。
For example:
例如:
strlen('Télécom')
should return 7 in Latin1 and 9 in UTF8
应该在 Latin1 中返回 7,在 UTF8 中返回 9
回答by AlexV
I made these little 2 functions that work well with UTF-8 and ISO-8859-1 detection / conversion...
我制作了这两个小函数,它们可以很好地与 UTF-8 和 ISO-8859-1 检测/转换配合使用...
function detect_encoding($string)
{
//http://w3.org/International/questions/qa-forms-utf-8.html
if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
return 'UTF-8';
//If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
//if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}
function convert_encoding($string, $to_encoding, $from_encoding = '')
{
if ($from_encoding == '')
$from_encoding = detect_encoding($string);
if ($from_encoding == $to_encoding)
return $string;
return mb_convert_encoding($string, $to_encoding, $from_encoding);
}
If your database contains strings in 2 different charsets, what I would do instead of plaguing all your application code with charset detection / conversion is to writhe a "one shot" script that will read all of your tables records and update their strings to the correct format (I would pick UTF-8 if I were you). This way your code will be cleaner and simpler to maintain.
如果您的数据库包含 2 个不同字符集的字符串,我会做的不是用字符集检测/转换来困扰您的所有应用程序代码,而是编写一个“一次性”脚本,该脚本将读取您的所有表记录并将其字符串更新为正确的格式(如果我是你,我会选择 UTF-8)。这样你的代码会更干净,更容易维护。
Just loop records in every tables of your database and convert strings like this:
只需在数据库的每个表中循环记录并像这样转换字符串:
//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');
回答by Dave
I didn't try your samples here, but from past experiences, there is a quick fix for this. Right after database connection execute the following query BEFORE running any other queries:
我没有在这里尝试你的样本,但根据过去的经验,有一个快速解决方案。在数据库连接后立即执行以下查询,然后再运行任何其他查询:
SET NAMES UTF8;
This is SQL Standard compliant, and works well with other databases, like Firebird and PostgreSQL.
这是符合 SQL 标准的,并且适用于其他数据库,如 Firebird 和 PostgreSQL。
But remember, you need ensure UTF-8 declarations on other spots too in order to make your application works fine. Follow a quick checklist.
但请记住,您也需要确保其他位置的 UTF-8 声明才能使您的应用程序正常工作。遵循快速检查清单。
- All files should be saved as UTF-8 (preferred without BOM [Byte Order Mask])
- Your HTTP Server should send the encoding header UTF-8. Use Firebug or Live HTTP Headers to inspect.
- If your server compress and/or tokenize the response, you may see header content as chunked or gzipped. This is not a problem if you save your files as UTF-8 and
- Declare encoding into HTML header, using proper meta tag.
- Over all application (sockets, file system, databases...) does not forget to flag up UTF-8 everytime you can. Making this when opening a database connection or so helps you to not need to encode/decode/debug all the time. Grab'em by root.
- 所有文件都应保存为 UTF-8(最好没有 BOM [字节顺序掩码])
- 您的 HTTP 服务器应该发送编码标头 UTF-8。使用 Firebug 或 Live HTTP Headers 进行检查。
- 如果您的服务器压缩和/或标记响应,您可能会看到分块或 gzipped 的标头内容。如果您将文件保存为 UTF-8 并且
- 使用适当的元标记将编码声明为 HTML 标头。
- 在所有应用程序(套接字、文件系统、数据库...)中,每次都不要忘记标记 UTF-8。在打开数据库连接时进行此操作可帮助您无需一直进行编码/解码/调试。从根上抓住它们。
回答by Dmitri
- What database do you use?
- You need to know the charset of original string before you convert it to utf-8, if it's in the ISO-8859-1 (latin1) then utf8_encode() is the easiest way, otherwise you need to use either icov or mbstring lib to convert and both of these need to know the charset of input in order to covert properly.
- Do you tell your database about charset when you insert/select data?
- 你用什么数据库?
- 在将其转换为 utf-8 之前,您需要知道原始字符串的字符集,如果它在 ISO-8859-1 (latin1) 中,则 utf8_encode() 是最简单的方法,否则您需要使用 icov 或 mbstring lib 来convert 并且这两个都需要知道输入的字符集才能正确转换。
- 当您插入/选择数据时,您是否告诉您的数据库有关字符集的信息?