php 如何检测是否必须对字符串应用 UTF-8 解码或编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4407854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 12:58:09  来源:igfitidea点击:

How do I detect if have to apply UTF-8 decode or encode on a string?

phpencodingutf-8

提问by Pentium10

I have a feed taken from third-party sites, and sometimes I have to apply utf8_decodeand other times utf8_encodeto get the desired visible output.

我有一个来自第三方网站的提要,有时我必须申请utf8_decode,有时我必须申请utf8_encode才能获得所需的可见输出。

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

如果错误地应用了两次相同的东西/或使用了错误的方法,我会得到更难看的东西,这就是我想要改变的。

How can I detect when what have to apply on the string?

如何检测何时必须在字符串上应用什么?

Actually the content returns UTF-8, but inside there are parts that are not.

实际上内容返回UTF-8,但里面有部分不是。

回答by bisko

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

我不能说我可以依靠mb_detect_encoding()。不久前我有一些奇怪的误报。

The most universal way I found to work well in every case was:

我发现在每种情况下都能很好地工作的最普遍的方法是:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

回答by Gordon

You can use

您可以使用

The character set might also be available in the HTTP response headersor in the response data itself.

字符集也可能在HTTP 响应头或响应数据本身中可用。

Example:

例子:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

输出(键盘):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

回答by George SEDRA

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("? Chrétiens d'Orient ? : la RATP fait marche arrière"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("?? Chr??tiens d'Orient ?? : la RATP fait marche arri?¨re"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)

回答by álvaro González

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().

编码自动保护不是防弹的,但您可以尝试mb_detect_encoding(). 另见mb_check_encoding()

回答by Femaref

The feed (I guess you mean some kind of XML-based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.

提要(我猜你的意思是某种基于 XML 的提要)应该在标题中有一个属性,告诉你编码是什么。如果没有,您就不走运了,因为您没有可靠的方法来识别编码。