php 将任何可转换的 utf8 字符音译为等效的 ascii
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13614622/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Transliterate any convertible utf8 char into ascii equivalent
提问by Ivan Hu?njak
Is there any good solution out there that does this transliteration in a good manner?
有没有什么好的解决方案可以很好地进行这种音译?
I've tried using iconv(), but is very annoying and it does not behave as one might expect.
我试过使用iconv(),但很烦人,而且它的行为不像人们预期的那样。
- Using
//TRANSLITwill try to replace what it can, leaving everything nonconvertible as "?" - Using
//IGNOREwill not leave "?" in text, but will also not transliterate and will also raiseE_NOTICEwhen nonconvertible char is found, so you have to use iconv with @ error suppressor - Using
//IGNORE//TRANSLIT(as some people suggested in PHP forum) is actually same as//IGNORE(tried it myself on php versions 5.3.2 and 5.3.13) - Also using
//TRANSLIT//IGNOREis same as//TRANSLIT
- Using
//TRANSLIT将尝试替换它可以替换的内容,将所有不可转换的内容保留为“?” - 使用
//IGNORE不会留下“?” 在文本中,但也不会音译,并且E_NOTICE在找到不可转换的字符时也会引发,因此您必须将 iconv 与 @ 错误抑制器一起使用 - 使用
//IGNORE//TRANSLIT(正如某些人在 PHP 论坛中建议的那样)实际上与//IGNORE(在 php 版本 5.3.2 和 5.3.13 上自己尝试过)相同 - 也使用
//TRANSLIT//IGNORE相同//TRANSLIT
It also uses current locale settings to transliterate.
它还使用当前的语言环境设置进行音译。
WARNING - a lot of text and code is following!
警告 - 大量文本和代码如下!
Here are some examples:
这里有些例子:
$text = 'Regular ascii text + ????? + ??ü? + é?ě??? + ?? + $ + ? + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
Ok, that did convert ? ? ? ? ? ü ? é ? ě ? ? ? and ?, but why not ? and ??
好的,那确实转换了?? ? ? ? ü ? é ? ? ? ? 和?,但为什么不呢?和 ??
// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian ?, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ??
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ? ? ü really is same as ae oe ue
Lets try with //IGNORE:
让我们尝试//IGNORE:
echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"
// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"
// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
Using solution of this guyalso does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + ? + @
使用此人的解决方案也无法正常工作:Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + ? + @
Even using PECL intl Normalizerclass (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:
即使使用 PECL intl Normalizer类(即使您的 PHP > 5.3.0 也不总是可用的,因为 ICU 包 intl 使用可能不适用于 PHP,即在某些托管服务器上)会产生错误的结果:
echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczs? + aou? + eeeeee + ?? + $ + ? + @
So is there any other way of doing this right or the only proper thing to do is to do preg_replace()or str_replace()and define transliteration tables yourself?
那么有没有其他方法可以正确地做到这一点,或者唯一正确的做法是自己做preg_replace()或str_replace()定义音译表?
// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliteratebut project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.
// 附录:我在 2008 年的 ZF wiki 辩论中发现了关于Zend_Filter_Transliterate 的提案,但由于在某些语言中无法转换(即中文),因此项目被放弃,但仍然适用于任何基于拉丁语和西里尔语的语言 IMO 此选项应该存在。
采纳答案by Nicolas Grekas
The toAscii() function of Patchwork\Utf8 does exactly this, see:
Patchwork\Utf8 的 toAscii() 函数正是这样做的,参见:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php
It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.
它利用 iconv 和 intl 的 Normalizer 来删除重音、拆分连字并执行许多其他通用音译。
回答by Alain Tiemblo
From this website, I found something that might help you :
从这个网站,我发现了一些可能对你有帮助的东西:
function removeAccents($str)
{
$a = array('à', 'á', '?', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', 'D', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'Y', '?', 'à', 'á', 'a', '?', '?', '?', '?', '?', 'è', 'é', 'ê', '?', 'ì', 'í', '?', '?', '?', 'ò', 'ó', '?', '?', '?', '?', 'ù', 'ú', '?', 'ü', 'y', '?', 'ā', 'ā', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ē', 'ē', '?', '?', '?', '?', '?', '?', 'ě', 'ě', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ī', 'ī', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ń', '?', '?', '?', 'ň', '?', 'ō', 'ō', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ū', 'ū', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'ǎ', 'ǎ', 'ǐ', 'ǐ', 'ǒ', 'ǒ', 'ǔ', 'ǔ', 'ǖ', 'ǖ', 'ǘ', 'ǘ', 'ǚ', 'ǚ', 'ǜ', 'ǜ', '?', '?', '?', '?', '?', '?');
$b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
return str_replace($a, $b, $str);
}
Usage example :
用法示例:
$text = 'Regular ascii text + ????? + ??ü? + é?ě??? + ?? + $ + ? + @';
echo removeAccents($text);
Displays :
显示:
Regular ascii text + cczsd + aous + eeeee? + aeo + $ + ? + @
You'll need to improve it, but you get the idea... If there is a direct way to do such a work, I don't know it.
你需要改进它,但你明白了......如果有直接的方法来做这样的工作,我不知道。
回答by user3914203
As none of the solutions above worked for me (I needed to transliterate many European character sets to ASCII), I finally found this old PECL package which just seemed to work http://derickrethans.nl/projects.html#translit. I had problems especially with cyrillic character sets, and this seems to handle them perfectly.
由于上述解决方案都不适合我(我需要将许多欧洲字符集音译为 ASCII),我终于找到了这个旧的 PECL 包,它似乎可以正常工作http://derickrethans.nl/projects.html#translit。我遇到了问题,尤其是西里尔字符集,这似乎可以完美地处理它们。
回答by Luke Madhanga
If I have understood you correctly, I may have an answer for you: I've written a basic PHP class that allows you to convert most characters into their ASCII equivalents.
如果我理解正确的话,我可能会给你一个答案:我编写了一个基本的 PHP 类,它允许您将大多数字符转换为它们的 ASCII 等价物。
Below is a screenshot of its output converting various composer names with accents in their name.
下面是其输出的屏幕截图,将各种作曲家名称转换为名称中的重音符号。
You can fork it from github here https://github.com/LukeMadhanga/transliterator.
你可以在这里从 github 分叉它https://github.com/LukeMadhanga/transliterator。
NB: It is as of yet undocumented but it should be p*** easy to get to grips with.
注意:它尚未记录在案,但应该很容易掌握。
回答by Alex
I think setting the right locale is the way to go. Be aware, that the specific locale must also be available on the system, check it using locale -a. If you only have de_DE.utf8- also you have to use set_locale(de_DE.utf8)
我认为设置正确的语言环境是要走的路。请注意,特定区域设置也必须在系统上可用,请使用locale -a. 如果你只有de_DE.utf8- 你也必须使用 set_locale( de_DE.utf8)


