php 如何检测是否必须对字符串应用 UTF-8 解码或编码？

Question

提问by Pentium10

I have a feed taken from third-party sites, and sometimes I have to apply utf8_decodeand other times utf8_encodeto get the desired visible output.

我有一个来自第三方网站的提要，有时我必须申请utf8_decode，有时我必须申请utf8_encode才能获得所需的可见输出。

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

如果错误地应用了两次相同的东西/或使用了错误的方法，我会得到更难看的东西，这就是我想要改变的。

How can I detect when what have to apply on the string?

如何检测何时必须在字符串上应用什么？

Actually the content returns UTF-8, but inside there are parts that are not.

实际上内容返回UTF-8，但里面有部分不是。

Answer 1

回答by bisko

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

我不能说我可以依靠mb_detect_encoding()。不久前我有一些奇怪的误报。

The most universal way I found to work well in every case was:

我发现在每种情况下都能很好地工作的最普遍的方法是：

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

Answer 2

回答by Gordon

You can use

您可以使用

mb_detect_encoding— Detect character encoding

mb_detect_encoding— 检测字符编码

The character set might also be available in the HTTP response headersor in the response data itself.

字符集也可能在HTTP 响应头或响应数据本身中可用。

Example:

例子：

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

输出（键盘）：

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

Answer 3

回答by George SEDRA

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("? Chrétiens d'Orient ? : la RATP fait marche arrière"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("?? Chr??tiens d'Orient ?? : la RATP fait marche arri?¨re"));
//string '? Chrétiens d'Orient ? : la RATP fait marche arrière' (length=56)

Answer 4

回答by álvaro González

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().

编码自动保护不是防弹的，但您可以尝试mb_detect_encoding(). 另见mb_check_encoding()。

Answer 5

回答by Femaref

The feed (I guess you mean some kind of XML-based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.

提要（我猜你的意思是某种基于 XML 的提要）应该在标题中有一个属性，告诉你编码是什么。如果没有，您就不走运了，因为您没有可靠的方法来识别编码。

php 如何检测是否必须对字符串应用 UTF-8 解码或编码？

提问by Pentium10

回答by bisko

回答by Gordon

回答by George SEDRA

回答by álvaro González

回答by Femaref

相关推荐

最近更新

标签

php 如何检测是否必须对字符串应用 UTF-8 解码或编码？

提问by Pentium10

回答by bisko

回答by Gordon

回答by George SEDRA

回答by álvaro González

回答by Femaref

相关推荐

php Smarty 以 strpos 为开始，以 strlen 为结束获取 var 的子字符串

php 使用 isset($_REQUEST["p"]) 或 $_REQUEST["p"]

PHP 将日期时间转换为秒

PHP 注意：未定义偏移量：1 读取数据时带数组

相关推荐

最近更新

标签