php 如何检测是否必须对字符串应用 UTF-8 解码或编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4407854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I detect if have to apply UTF-8 decode or encode on a string?
提问by Pentium10
I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode
and other times utf8_encode
to get the desired visible output.
我有一个来自第三方网站的提要,有时我必须申请utf8_decode
,有时我必须申请utf8_encode
才能获得所需的可见输出。
If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.
如果错误地应用了两次相同的东西/或使用了错误的方法,我会得到更难看的东西,这就是我想要改变的。
How can I detect when what have to apply on the string?
如何检测何时必须在字符串上应用什么?
Actually the content returns UTF-8, but inside there are parts that are not.
实际上内容返回UTF-8,但里面有部分不是。
回答by bisko
I can't say I can rely on mb_detect_encoding()
. I had some freaky false positives a while back.
我不能说我可以依靠mb_detect_encoding()
。不久前我有一些奇怪的误报。
The most universal way I found to work well in every case was:
我发现在每种情况下都能很好地工作的最普遍的方法是:
if (preg_match('!!u', $string))
{
// This is UTF-8
}
else
{
// Definitely not UTF-8
}
回答by Gordon
You can use
您可以使用
mb_detect_encoding
— Detect character encoding
mb_detect_encoding
— 检测字符编码
The character set might also be available in the HTTP response headersor in the response data itself.
字符集也可能在HTTP 响应头或响应数据本身中可用。
Example:
例子:
var_dump(
mb_detect_encoding(
file_get_contents('http://stackoverflow.com/questions/4407854')
),
$http_response_header
);
Output (codepad):
输出(键盘):
string(5) "UTF-8"
array(9) {
[0]=>
string(15) "HTTP/1.1 200 OK"
[1]=>
string(33) "Cache-Control: public, max-age=11"
[2]=>
string(38) "Content-Type: text/html; charset=utf-8"
[3]=>
string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
[4]=>
string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
[5]=>
string(7) "Vary: *"
[6]=>
string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
[7]=>
string(17) "Connection: close"
[8]=>
string(21) "Content-Length: 34119"
}
回答by George SEDRA
function str_to_utf8 ($str) {
$decoded = utf8_decode($str);
if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
return $str;
return $decoded;
}
var_dump(str_to_utf8("? Chrétiens d'Orient ? : la RATP fait marche arrière"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("?? Chr??tiens d'Orient ?? : la RATP fait marche arri?¨re"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)
回答by álvaro González
Encoding autotection is not bullet-proof but you can try mb_detect_encoding()
. See also mb_check_encoding()
.
编码自动保护不是防弹的,但您可以尝试mb_detect_encoding()
. 另见mb_check_encoding()
。
回答by Femaref
The feed (I guess you mean some kind of XML-based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.
提要(我猜你的意思是某种基于 XML 的提要)应该在标题中有一个属性,告诉你编码是什么。如果没有,您就不走运了,因为您没有可靠的方法来识别编码。