php 从字符串中删除变音符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3635511/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove diacritics from a string
提问by Richard Knop
Is it possible? This is my input string:
是否可以?这是我的输入字符串:
? ? ? ? ? y á í é ? á ? Y
This is the output I want:
这是我想要的输出:
l s c t z y a i e C A Z Y
回答by gabo
if you have http://php.net/manual/en/book.intl.phpavailable, you can use this:
如果你有http://php.net/manual/en/book.intl.php可用,你可以使用这个:
$string = "Fó? B?r";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);
回答by shamittomar
There is a function that Wordpress uses and works nice. Here's the working code with output.
Wordpress 使用了一个功能并且效果很好。这是带有输出的工作代码。
<?php
function seems_utf8($str)
{
$length = strlen($str);
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
/**
* Converts all accent characters to ASCII characters.
*
* If there are no accent characters, then the string given is just returned.
*
* @param string $string Text that might have accent characters
* @return string Filtered string with replaced "nice" characters.
*/
function remove_accents($string) {
if ( !preg_match('/[\x80-\xff]/', $string) )
return $string;
if (seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
chr(195).chr(191) => 'y',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
// Euro Sign
chr(226).chr(130).chr(172) => 'E',
// GBP (Pound) Sign
chr(194).chr(163) => '');
$string = strtr($string, $chars);
} else {
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
.chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
.chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
.chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
.chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
.chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
.chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
.chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
.chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
.chr(252).chr(253).chr(255);
$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
$string = strtr($string, $chars['in'], $chars['out']);
$double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}
return $string;
}
$str = "? ? ? ? ? y á í é ? á ? Y";
echo remove_accents($str); // Output: l s c t z y a i e C A Z Y
?>
回答by raugfer
preg_replace('/&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);/i','',htmlentities($value));
回答by Dominique
$table = array(
' '=>'-', '?'=>'S', '?'=>'s', 'D'=>'Dj', '?'=>'Z', '?'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
'à'=>'A', 'á'=>'A', '?'=>'A', '?'=>'A', '?'=>'A', '?'=>'A', '?'=>'A', '?'=>'C', 'è'=>'E', 'é'=>'E',
'ê'=>'E', '?'=>'E', 'ì'=>'I', 'í'=>'I', '?'=>'I', '?'=>'I', '?'=>'N', 'ò'=>'O', 'ó'=>'O', '?'=>'O',
'?'=>'O', '?'=>'O', '?'=>'O', 'ù'=>'U', 'ú'=>'U', '?'=>'U', 'ü'=>'U', 'Y'=>'Y', 'T'=>'B', '?'=>'Ss',
'à'=>'a', 'á'=>'a', 'a'=>'a', '?'=>'a', '?'=>'a', '?'=>'a', '?'=>'a', '?'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', '?'=>'e', 'ì'=>'i', 'í'=>'i', '?'=>'i', '?'=>'i', 'e'=>'o', '?'=>'n', 'ò'=>'o', 'ó'=>'o',
'?'=>'o', '?'=>'o', '?'=>'o', '?'=>'o', 'ù'=>'u', 'ú'=>'u', '?'=>'u', 'y'=>'y', 'y'=>'y', 't'=>'b',
'?'=>'y', 'R'=>'R', 'r'=>'r', "'"=>'-', '"'=>'-'
);
$string = strtr($url, $table);
回答by Cequiel
based on the reply of raugfer:
根据 raugfer 的回复:
// removes diacritics from a string
$regexp = '/&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);/i';
echo html_entity_decode(preg_replace($regexp, '', htmlentities($str)));
回答by Gian
回答by Gras Double
You might want to look at CodeIgniter's convert_accented_characters
function. It is defined in system/helpers/text_helper.phpand uses the array located at application/config/foreign_chars.php.
您可能想查看 CodeIgniter 的convert_accented_characters
函数。它在system/helpers/text_helper.php 中定义,并使用位于application/config/foreign_chars.php的数组。
Here is what it does, in a nutshell:
简而言之,这就是它的作用:
$foreign_characters = array(
'/?|?|?/' => 'ae',
'/?|?/' => 'oe',
'/ü/' => 'ue',
// etc.
);
$array_from = array_keys($foreign_characters);
$array_to = array_values($foreign_characters);
$result = preg_replace($array_from, $array_to, $str);
However, it is a bit slow as it does a bunch of regex replacements. Maybe I should give iconv
and WordPress's remove_accents
a try.
然而,它有点慢,因为它做了一堆正则表达式替换。也许我应该尝试一下iconv
WordPress remove_accents
。
回答by user3054345
I had the same problem with the romainian characters. In case you want to keep the characters as they are and save them into a database such as mysql ,use the html special characters code. Otherwise, if your goal is the second array, then use this code:
我对罗马尼亚字符有同样的问题。如果您想保持字符原样并将它们保存到 mysql 等数据库中,请使用 html 特殊字符代码。否则,如果您的目标是第二个数组,则使用以下代码:
function diac($text)
{
$diac = array('?','?','?','?','?','y','á','í','é','?','á','?','Y');
$cor = array('l','s','c','t','z','y','a' ,'i' ,'e', 'C' ,'A', 'Z', 'Y');
$text = str_replace($diac,$cor,$text);
return $text;
};
回答by infralabs
I hope this will be useful for anybody: https://github.com/infralabs/DiacriticsRemovePHP
我希望这对任何人都有用:https: //github.com/infralabs/DiacriticsRemovePHP
The full post with usage example and source/result comparison is in this thread: Change foreign characters to normal equivalent
带有使用示例和源/结果比较的完整帖子在此线程中:将 外来字符更改为正常等效项