确保 PHP 中的 UTF-8 有效

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1523460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 02:56:01  来源:igfitidea点击:

Ensuring valid UTF-8 in PHP

phpencodingutf-8

提问by Brian

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

我正在使用 PHP 来处理来自各种来源的文本。我不认为它会是 UTF-8、ISO 8859-1Windows-1252以外的任何东西。如果不是其中之一,我只需要确保文本变成有效的 UTF-8 字符串,即使字符丢失。iconv 的 //TRANSLIT 选项是否解决了这个问题?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

例如,此代码是否可以确保将字符串安全地插入到 UTF-8 编码的文档(或数据库)中?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}

回答by bobince

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

UTF-8 可以存储任何 Unicode 字符。如果您的编码是其他任何东西,包括 ISO-8859-1 或 Windows-1252,UTF-8 可以存储其中的每个字符。因此,当您将字符串从任何其他编码转换为 UTF-8 时,您不必担心丢失任何字符。

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

此外,ISO-8859-1 和 Windows-1252 都是单字节编码,其中任何字节都是有效的。在技​​术上无法区分它们。我会选择 Windows-1252 作为非 UTF-8 序列的默认匹配,因为唯一解码不同的字节是范围 0x80-0x9F。这些在 Windows-1252 中解码为各种字符,如智能引号和欧元,而在 ISO-8859-1 中,它们是几乎从未使用过的不可见控制字符。Web 浏览器有时可能会说他们使用的是 ISO-8859-1,但通常他们实际上会使用 Windows-1252。

would this code ensure that a string is safe to insert into a UTF-8 encoded document

这段代码会确保一个字符串可以安全地插入到 UTF-8 编码的文档中吗?

You would certainly want to set the optional ‘strict' parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

为此,您当然希望将可选的 'strict' 参数设置为 TRUE。但我不确定这实际上涵盖了所有无效的 UTF-8 序列。该函数不要求明确检查字节序列的 UTF-8 有效性。在已知情况下,mb_detect_encoding 之前会错误地猜测 UTF-8,但我不知道在严格模式下是否仍然会发生这种情况。

If you want to be sure, do it yourself using the W3-recommended regex:

如果您想确定,请使用W3 推荐的正则表达式自行完成:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

回答by Frosty Z

With the mbstringlibrary, you have mb_check_encoding().

使用mbstring库,您有mb_check_encoding()

Example of use:

使用示例:

mb_check_encoding($string, 'UTF-8');

With PHP 7.1.9 on a recent Windows 10 system, the regexsolution outperforms mb_check_encoding()for any string length (still 20,000 iterations):

在最近的 Windows 10 系统上使用 PHP 7.1.9,正则表达式解决方案优于mb_check_encoding()任何字符串长度(仍然是 20,000 次迭代):

  • 10 characters: regex => 4 ms, mb_check_encoding()=> 64 ms
  • 10000 chars: regex => 125 ms, mb_check_encoding()=> 2.4 s
  • 10 个字符:正则表达式 => 4 毫秒,mb_check_encoding()=> 64 毫秒
  • 10000 个字符:正则表达式 => 125 毫秒,mb_check_encoding()=> 2.4 秒

回答by eyecatchUp

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:

请注意:您可以简单地使用 'u' 修饰符来测试字符串的 UTF-8 有效性,而不是使用W3C经常推荐的(相当复杂的)正则表达式

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }

回答by Martijn

Have a look at http://www.phpwact.org/php/i18n/charsetsfor a guide about character sets. This page links to a page specifically for UTF-8.

有关字符集的指南,查看http://www.phpwact.org/php/i18n/charsets。此页面链接到专门针对 UTF-8 的页面。

回答by Nadir

Answer to "iconv is idempotent":

回答“iconv 是幂等的”:

Neither is iconv - iconv is not idempotent.

iconv 也不是 - iconv 不是幂等的。

A big difference between utf8_encode()and iconv()is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:

utf8_encode()和之间的一个很大区别iconv()是 iconv 可能会引发这样的错误“在输入字符串中检测到一个不完整的多字节字符”,即使是:

iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)

iconv('ISO-8859-1', 'UTF-8'.'//忽略', $str)

in the above code:

在上面的代码中:

$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

You have to know mb_detect_encoding. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).

你必须知道mb_detect_encoding。即使对于无效的 UTF-8 字符串(格式错误的 UTF-8),它也可以回答有关 uft-8 的问题。