php 检测编码并使所有内容都为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/910793/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 00:17:59  来源:igfitidea点击:

Detect encoding and make everything UTF-8

phpencodingutf-8character-encoding

提问by caw

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

我正在从各种 RSS 提要中读出大量文本并将它们插入到我的数据库中。

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

当然,提要中使用了几种不同的字符编码,例如 UTF-8 和 ISO 8859-1。

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

不幸的是,文本的编码有时会出现问题。例子:

  1. The "?" in "Fu?ball" should look like this in my database: "??". If it is a "??", it is displayed correctly.

  2. Sometimes, the "?" in "Fu?ball" looks like this in my database: "????". Then it is displayed wrongly, of course.

  3. In other cases, the "?" is saved as a "?" - so without any change. Then it is also displayed wrongly.

  1. 这 ”?” 在“Fu?ball”中应该在我的数据库中看起来像这样:“??”。如果是“??”,则显示正确。

  2. 有时,“?” 在“Fu?ball”中在我的数据库中看起来像这样:“????”。那么它当然是错误显示的。

  3. 在其他情况下,“?” 保存为“?” - 所以没有任何变化。然后它也显示错误。

What can I do to avoid the cases 2 and 3?

我能做些什么来避免情况 2 和 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode()(it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

我怎样才能使所有内容都使用相同的编码,最好是 UTF-8?什么时候必须使用utf8_encode(),什么时候必须使用utf8_decode()(效果很明显,但什么时候必须使用函数?)以及什么时候必须对输入不做任何事情?

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

如何使所有内容都具有相同的编码?也许与功能mb_detect_encoding()?我可以为此编写一个函数吗?所以我的问题是:

  1. How do I find out what encoding the text uses?
  2. How do I convert it to UTF-8 - whatever the old encoding is?
  1. 如何找出文本使用的编码?
  2. 如何将其转换为 UTF-8 - 无论旧编码是什么?

Would a function like this work?

像这样的功能会起作用吗?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. What's wrong with it?

我已经测试过了,但它不起作用。它出什么问题了?

回答by Sebastián Grignoli

If you apply utf8_encode()to an already UTF-8 string, it will return garbled UTF-8 output.

如果你应用utf8_encode()到一个已经是 UTF-8 的字符串,它会返回乱码的 UTF-8 输出。

I made a function that addresses all this issues. It′s called Encoding::toUTF8().

我做了一个函数来解决所有这些问题。它被称为Encoding::toUTF8()

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252or UTF-8, or the string can have a mix of them. Encoding::toUTF8()will convert everything to UTF-8.

您不需要知道字符串的编码是什么。它可以是 Latin1 ( ISO 8859-1)Windows-1252或 UTF-8,或者字符串可以是它们的混合。Encoding::toUTF8()将所有内容转换为 UTF-8。

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

我这样做是因为一个服务给我提供了一个全乱七八糟的数据源,在同一个字符串中混合了 UTF-8 和 Latin1。

Usage:

用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

下载:

https://github.com/neitanod/forceutf8

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

我已经包含了另一个函数 ,Encoding::fixUFT8()它将修复每个看起来乱码的 UTF-8 字符串。

Usage:

用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

例子:

echo Encoding::fixUTF8("F??d??ration Camerounaise de Football");
echo Encoding::fixUTF8("F???d???ration Camerounaise de Football");
echo Encoding::fixUTF8("F?????d?????ration Camerounaise de Football");
echo Encoding::fixUTF8("F???dération Camerounaise de Football");

will output:

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

我已将函数 ( forceUTF8) 转换为名为Encoding. 新功能是Encoding::toUTF8().

回答by Gumbo

You first have to detect what encoding has been used. As you're parsing RSS feeds (probably via HTTP), you should read the encoding from the charsetparameter of the Content-TypeHTTP header field. If it is not present, read the encoding from the encodingattribute of the XML processing instruction. If that's missing too, use UTF-8 as defined in the specification.

您首先必须检测使用了什么编码。当您解析 RSS 提要(可能通过 HTTP)时,您应该从HTTP 标头字段charset参数中读取编码。如果不存在,则从XML 处理指令的属性中读取编码。如果这也缺失,请使用规范中定义的 UTF-8Content-Typeencoding



Edit???Here is what I probably would do:

编辑???这是我可能会做的:

I'd use cURLto send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Typeheader field that contains the MIME type and (hopefully) the charsetparameter with the encoding/charset too. If not, we'll analyse the XML PI for the presence of the encodingattribute and get the encoding from there. If that's also missing, the XML specs define to use UTF-8 as encoding.

我会使用cURL发送和获取响应。这允许您设置特定的标头字段并获取响应标头。获取响应后,您必须解析 HTTP 响应并将其拆分为 header 和 body。然后,标头应包含Content-Type包含 MIME 类型的标头字段和(希望)charset带有编码/字符集的参数。如果没有,我们将分析 XML PI 是否存在该encoding属性并从那里获取编码。如果这也缺失,XML 规范定义使用 UTF-8 作为编码。

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}

回答by troelskn

Detecting the encoding is hard.

检测编码很困难。

mb_detect_encodingworks by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

mb_detect_encoding根据您通过的候选人数量进行猜测。在某些编码中,某些字节序列是无效的,因此它可以区分各种候选者。不幸的是,有很多编码,其中相同的字节是有效的(但不同)。在这些情况下,无法确定编码;在这些情况下,您可以实现自己的逻辑来进行猜测。例如,来自日语站点的数据可能更有可能使用日语编码。

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding(note that validis not the same as being- the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encodingto distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

只要您只处理西欧语言,要考虑的三种主要编码是utf-8,iso-8859-1cp-1252。由于这些是许多平台的默认设置,因此它们也最有可能被错误报告。例如。如果人们使用不同的编码,他们可能会坦诚相待,否则他们的软件会经常崩溃。因此,一个好的策略是信任提供者,除非编码被报告为这三个之一。您仍然应该仔细检查它是否确实有效,使用mb_check_encoding(请注意,有效存在不同- 相同的输入可能对许多编码有效)。如果是其中之一,则可以使用mb_detect_encoding来区分它们。幸运的是,这是相当确定的。您只需要使用正确的检测序列,即UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8is the only sane choice). The function utf8_encodetransforms ISO-8859-1to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

检测到编码后,您需要将其转换为内部表示(这UTF-8是唯一明智的选择)。该函数utf8_encode转换ISO-8859-1UTF-8,因此它只能用于该特定输入类型。对于其他编码,请使用mb_convert_encoding.

回答by harpax

A reallynice way to implement an isUTF8-function can be found on php.net:

php.net上可以找到一个非常好的实现isUTF8-function 的方法:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}

回答by miek

This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

此备忘单列出了一些与 PHP 中的 UTF-8 处理相关的常见注意事项: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

This function detecting multibyte characters in a string might also prove helpful (source):

此函数检测字符串中的多字节字符也可能有用(来源):


function detectUTF8($string)
{
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs', 
    $string);
}

回答by Krynble

A little heads up. You said that the "?" should be displayed as "??" in your database.

点点头。你说那个“?” 应显示为“??” 在您的数据库中。

This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.

这可能是因为您使用的是带有 Latin-1 字符编码的数据库,或者您的 PHP-MySQL 连接设置错误,也就是说,P 认为您的 MySQL 设置为使用 UTF-8,因此它以 UTF-8 发送数据,但是你的MySQL认为PHP正在发送编码为ISO 8859-1的数据,因此它可能会再次尝试将您发送的数据编码为UTF-8,从而导致这种麻烦。

Take a look at mysql_set_charset. It may help you.

看看mysql_set_charset。它可能会帮助你。

回答by Ivan Vu?ica

Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

您的编码看起来像是两次编码为 UTF-8 ;也就是说,从其他一些编码转换为 UTF-8,然后再转换为 UTF-8。就像您拥有 ISO 8859-1,从 ISO 8859-1 转换为 UTF-8,并将新字符串视为 ISO 8859-1 以再次转换为 UTF-8。

Here's some pseudocode of what you did:

这是你所做的一些伪代码:

$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);

You should try:

你应该试试:

  1. detect encoding using mb_detect_encoding()or whatever you like to use
  2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
  3. finally, convert back into UTF-8
  1. 检测编码使用mb_detect_encoding()或任何你喜欢使用
  2. 如果是 UTF-8,则转换为 ISO 8859-1,然后重复步骤 1
  3. 最后,转换回 UTF-8

That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

那是假设在“中间”转换中您使用了 ISO 8859-1。如果您使用的是 Windows-1252,则转换为 Windows-1252 (latin1)。原始源编码并不重要;您在有缺陷的第二次转换中使用的那个是。

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

这是我对发生的事情的猜测;要获得四个字节来代替一个扩展的 ASCII 字节,您几乎没有其他办法。

The German language also uses ISO 8859-2and Windows-1250(Latin-2).

德语还使用ISO 8859-2Windows-1250(Latin-2)。

回答by cavila

You need to test the character set on input since responses can come coded with different encodings.

您需要测试输入的字符集,因为响应可以使用不同的编码进行编码。

I force all content been sent into UTF-8 by doing detection and translation using the following function:

我通过使用以下函数进行检测和翻译,强制将所有内容发送到 UTF-8:

function fixRequestCharset()
{
  $ref = array(&$_GET, &$_POST, &$_REQUEST);
  foreach ($ref as &$var)
  {
    foreach ($var as $key => $val)
    {
      $encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
      if (!$encoding)
        continue;
      if (strcasecmp($encoding, 'UTF-8') != 0)
      {
        $encoding = iconv($encoding, 'UTF-8', $var[$key]);
        if ($encoding === false)
          continue;
        $var[$key] = $encoding;
      }
    }
  }
}

That routine will turn all PHP variables that come from the remote host into UTF-8.

该例程会将来自远程主机的所有 PHP 变量转换为 UTF-8。

Or ignore the value if the encoding could not be detected or converted.

或者,如果无法检测或转换编码,则忽略该值。

You can customize it to your needs.

您可以根据需要对其进行自定义。

Just invoke it before using the variables.

只需在使用变量之前调用它。

回答by Halil ?zgür

The interesting thing about mb_detect_encodingand mb_convert_encodingis that the order of the encodings you suggest does matter:

关于有趣的mb_detect_encoding,并mb_convert_encoding为您的建议进行编码的顺序事做:

// $input is actually UTF-8

mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)

mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)

So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.

因此,您可能希望在指定预期编码时使用特定顺序。不过,请记住,这并非万无一失。

回答by Mauro

I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here's my notes:

我检查了,因为到编码解决方案的年龄,而这个页面大概是多年积累的搜索结束!我测试了你提到的一些建议,这是我的笔记:

This is my test string:

这是我的测试字符串:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

这是一个“wròng wrìtten”字符串,但我需要 pù 'sòme' 特殊字符来查看 thèm,由 fùnctìon 转换!!& 就是这样!

I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

我执行 INSERT 以将此字符串保存在设置为的字段中的数据库中 utf8_general_ci

The character set of my page is UTF-8.

我的页面的字符集是 UTF-8。

If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

如果我像那样执行 INSERT,在我的数据库中,我有一些可能来自火星的字符......

So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

所以我需要将它们转换成一些“理智”的 UTF-8。我试过了utf8_encode(),但外星人字符仍然在入侵我的数据库......

So I tried to use the function forceUTF8posted on number 8, but in the database the string saved looks like this:

所以我尝试使用第forceUTF88 号发布的函数,但在数据库中保存的字符串如下所示:

this is a "wr?2ng wr??tten" string b?1t I n?¨ed to p?1 's?2me' special ch? rs to see th?¨m, convert?¨d by f?1nct??on!! & that's it!

这是一个 "wr?2ng wr??tten" 字符串 b?1t I n?¨ed to p?1 's?2me' special ch? rs 看到 th?m,转换?¨d 由 f?1nct??on!& 就是这样!

So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:

因此,在此页面上收集更多信息并将它们与其他页面上的其他信息合并,我使用此解决方案解决了我的问题:

$finallyIDidIt = mb_convert_encoding(
  $string,
  mysql_client_encoding($resourceID),
  mb_detect_encoding($string)
);

Now in my database I have my string with correct encoding.

现在在我的数据库中,我的字符串具有正确的编码。

NOTE:Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

注意:唯一需要注意的功能是mysql_client_encoding!你需要连接到数据库,因为这个函数需要一个资源ID作为参数。

But well, I just do that re-encoding before my INSERT so for me it is not a problem.

但是,我只是在插入之前重新编码,所以对我来说这不是问题。