如何在 PHP 上将任何字符编码转换为 UTF8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6559822/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert any character encoding to UTF8 on PHP
提问by rafaschutz
I'm working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.
我正在开发一个网络爬虫,它从世界各地的站点抓取数据,并处理不同的语言和编码。
Currently I'm using the following function, and it works in 99% of the cases. But there is this 1% that is giving me headaches.
目前我正在使用以下功能,它适用于 99% 的情况。但是有这 1% 让我头疼。
function convertEncoding($str) {
return iconv(mb_detect_encoding($str), "UTF-8", $str);
}
回答by sagi
Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:
与其盲目地尝试检测编码,不如先检查您下载的页面是否具有列出的字符集。字符集可以在 HTTP 响应头中设置,例如:
Content-Type:text/html; charset=utf-8
Or in the HTML as a meta tag, for example:
或者在 HTML 中作为元标记,例如:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.
只有当两者都不可用时,才尝试使用 mb_detect_encoding() 或其他方法猜测编码。
回答by Emre Yazici
It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconvand mbstringfunctions. I recommend using a function like this and supplying from charsetwhenever possible:
由于某些字符集是其他字符集的子集,因此不可能以 100% 的比率检测字符串的字符集。如果可能,请尝试明确设置字符集,而不混合iconv和mbstring函数。我建议使用这样的函数并尽可能从字符集提供:
function convertEncoding($str, $from = 'auto', $to = "UTF-8") {
if($from == 'auto') $from = mb_detect_encoding($str);
return mb_convert_encoding ($str , $to, $from);
}
回答by Kulin Choksi
You can try utf_encode($str).
你可以试试 utf_encode($str)。
http://www.php.net/manual/en/function.utf8-encode.php#89789
http://www.php.net/manual/en/function.utf8-encode.php#89789
Or you can replace the content type meta tag with
或者您可以将内容类型元标记替换为
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
from header of crawled content
来自已爬取内容的标题