如何在 PHP 上将任何字符编码转换为 UTF8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6559822/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 00:43:39  来源:igfitidea点击:

How to convert any character encoding to UTF8 on PHP

phpencodingutf-8

提问by rafaschutz

I'm working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.

我正在开发一个网络爬虫,它从世界各地的站点抓取数据,并处理不同的语言和编码。

Currently I'm using the following function, and it works in 99% of the cases. But there is this 1% that is giving me headaches.

目前我正在使用以下功能,它适用于 99% 的情况。但是有这 1% 让我头疼。

function convertEncoding($str) {
    return iconv(mb_detect_encoding($str), "UTF-8", $str);
}

回答by sagi

Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:

与其盲目地尝试检测编码,不如先检查您下载的页面是否具有列出的字符集。字符集可以在 HTTP 响应头中设置,例如:

Content-Type:text/html; charset=utf-8

Or in the HTML as a meta tag, for example:

或者在 HTML 中作为元标记,例如:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.

只有当两者都不可用时,才尝试使用 mb_detect_encoding() 或其他方法猜测编码。

回答by Emre Yazici

It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconvand mbstringfunctions. I recommend using a function like this and supplying from charsetwhenever possible:

由于某些字符集是其他字符集的子集,因此不可能以 100% 的比率检测字符串的字符集。如果可能,请尝试明确设置字符集,而不混合iconvmbstring函数。我建议使用这样的函数并尽可能从字符集提供:

function convertEncoding($str, $from = 'auto', $to = "UTF-8") {
    if($from == 'auto') $from = mb_detect_encoding($str);
    return mb_convert_encoding ($str , $to, $from); 
}

回答by Kulin Choksi

You can try utf_encode($str).

你可以试试 utf_encode($str)。

http://www.php.net/manual/en/function.utf8-encode.php#89789

http://www.php.net/manual/en/function.utf8-encode.php#89789

Or you can replace the content type meta tag with

或者您可以将内容类型元标记替换为

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

from header of crawled content

来自已爬取内容的标题