php 将外来字符更改为对应的罗马字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6837148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 01:26:04  来源:igfitidea点击:

Change foreign characters to their roman equivalent

phpregexspecial-charactersstring-search

提问by ThomasReggi

I am using php and I was wondering if there was a predefined way to convert foreign characters to their non-foreign alternatives.

我正在使用 php,我想知道是否有一种预定义的方法可以将外来字符转换为非外来字符。

Characters such as ê, ?, éall resulting to 'e'.
I'm looking for a function that would take a string and return it without the special characters.
Any ideas would be greatly appreciated!

ê, ?, é等字符都生成'e'
我正在寻找一个函数,它可以接受一个字符串并在没有特殊字符的情况下返回它。
任何想法将不胜感激!

回答by Edgar Zagórski

After failing to find suitable convertors I created my own collection that suits my needs including my favorite Cyrillic conversion that by default has numerous variations.

在找不到合适的转换器后,我创建了自己的合集以满足我的需求,包括我最喜欢的 Cyrillic 转换,默认情况下它有很多变体。

function transliterateString($txt) {
    $transliterationTable = array('á' => 'a', 'á' => 'A', 'à' => 'a', 'à' => 'A', '?' => 'a', '?' => 'A', 'a' => 'a', '?' => 'A', '?' => 'a', '?' => 'A', '?' => 'a', '?' => 'A', '?' => 'a', '?' => 'A', 'ā' => 'a', 'ā' => 'A', '?' => 'ae', '?' => 'AE', '?' => 'ae', '?' => 'AE', '?' => 'b', '?' => 'B', '?' => 'c', '?' => 'C', '?' => 'c', '?' => 'C', '?' => 'c', '?' => 'C', '?' => 'c', '?' => 'C', '?' => 'c', '?' => 'C', '?' => 'd', '?' => 'D', '?' => 'd', '?' => 'D', '?' => 'd', '?' => 'D', 'e' => 'dh', 'D' => 'Dh', 'é' => 'e', 'é' => 'E', 'è' => 'e', 'è' => 'E', '?' => 'e', '?' => 'E', 'ê' => 'e', 'ê' => 'E', 'ě' => 'e', 'ě' => 'E', '?' => 'e', '?' => 'E', '?' => 'e', '?' => 'E', '?' => 'e', '?' => 'E', 'ē' => 'e', 'ē' => 'E', '?' => 'f', '?' => 'F', '?' => 'f', '?' => 'F', '?' => 'g', '?' => 'G', '?' => 'g', '?' => 'G', '?' => 'g', '?' => 'G', '?' => 'g', '?' => 'G', '?' => 'h', '?' => 'H', '?' => 'h', '?' => 'H', 'í' => 'i', 'í' => 'I', 'ì' => 'i', 'ì' => 'I', '?' => 'i', '?' => 'I', '?' => 'i', '?' => 'I', '?' => 'i', '?' => 'I', '?' => 'i', '?' => 'I', 'ī' => 'i', 'ī' => 'I', '?' => 'j', '?' => 'J', '?' => 'k', '?' => 'K', '?' => 'l', '?' => 'L', '?' => 'l', '?' => 'L', '?' => 'l', '?' => 'L', '?' => 'l', '?' => 'L', '?' => 'm', '?' => 'M', 'ń' => 'n', '?' => 'N', 'ň' => 'n', '?' => 'N', '?' => 'n', '?' => 'N', '?' => 'n', '?' => 'N', 'ó' => 'o', 'ó' => 'O', 'ò' => 'o', 'ò' => 'O', '?' => 'o', '?' => 'O', '?' => 'o', '?' => 'O', '?' => 'o', '?' => 'O', '?' => 'oe', '?' => 'OE', 'ō' => 'o', 'ō' => 'O', '?' => 'o', '?' => 'O', '?' => 'oe', '?' => 'OE', '?' => 'p', '?' => 'P', '?' => 'r', '?' => 'R', '?' => 'r', '?' => 'R', '?' => 'r', '?' => 'R', '?' => 's', '?' => 'S', '?' => 's', '?' => 'S', '?' => 's', '?' => 'S', '?' => 's', '?' => 'S', '?' => 's', '?' => 'S', '?' => 's', '?' => 'S', '?' => 'SS', '?' => 't', '?' => 'T', '?' => 't', '?' => 'T', '?' => 't', '?' => 'T', '?' => 't', '?' => 'T', '?' => 't', '?' => 'T', 'ú' => 'u', 'ú' => 'U', 'ù' => 'u', 'ù' => 'U', '?' => 'u', '?' => 'U', '?' => 'u', '?' => 'U', '?' => 'u', '?' => 'U', '?' => 'u', '?' => 'U', '?' => 'u', '?' => 'U', '?' => 'u', '?' => 'U', 'ū' => 'u', 'ū' => 'U', '?' => 'u', '?' => 'U', 'ü' => 'ue', 'ü' => 'UE', '?' => 'w', '?' => 'W', '?' => 'w', '?' => 'W', '?' => 'w', '?' => 'W', '?' => 'w', '?' => 'W', 'y' => 'y', 'Y' => 'Y', '?' => 'y', '?' => 'Y', '?' => 'y', '?' => 'Y', '?' => 'y', '?' => 'Y', '?' => 'z', '?' => 'Z', '?' => 'z', '?' => 'Z', '?' => 'z', '?' => 'Z', 't' => 'th', 'T' => 'Th', 'μ' => 'u', 'а' => 'a', 'А' => 'a', 'б' => 'b', 'Б' => 'b', 'в' => 'v', 'В' => 'v', 'г' => 'g', 'Г' => 'g', 'д' => 'd', 'Д' => 'd', 'е' => 'e', 'Е' => 'E', 'ё' => 'e', 'Ё' => 'E', 'ж' => 'zh', 'Ж' => 'zh', 'з' => 'z', 'З' => 'z', 'и' => 'i', 'И' => 'i', 'й' => 'j', 'Й' => 'j', 'к' => 'k', 'К' => 'k', 'л' => 'l', 'Л' => 'l', 'м' => 'm', 'М' => 'm', 'н' => 'n', 'Н' => 'n', 'о' => 'o', 'О' => 'o', 'п' => 'p', 'П' => 'p', 'р' => 'r', 'Р' => 'r', 'с' => 's', 'С' => 's', 'т' => 't', 'Т' => 't', 'у' => 'u', 'У' => 'u', 'ф' => 'f', 'Ф' => 'f', 'х' => 'h', 'Х' => 'h', 'ц' => 'c', 'Ц' => 'c', 'ч' => 'ch', 'Ч' => 'ch', 'ш' => 'sh', 'Ш' => 'sh', 'щ' => 'sch', 'Щ' => 'sch', 'ъ' => '', 'Ъ' => '', 'ы' => 'y', 'Ы' => 'y', 'ь' => '', 'Ь' => '', 'э' => 'e', 'Э' => 'e', 'ю' => 'ju', 'Ю' => 'ju', 'я' => 'ja', 'Я' => 'ja');
    return str_replace(array_keys($transliterationTable), array_values($transliterationTable), $txt);
}

回答by PuReWebDev

My first recommendation is the iconv function. Namely because it's built into PHP, so doesn't require any external or 3rd party libraries. In addition, it's a function that's designed to do precisely what you are trying to accomplish (accept on character set as input, and output an alternate character set, specifically going from UTF-8 to ASCII). Below is an example of how to call this function:

我的第一个建议是 iconv 函数。即因为它内置在 PHP 中,所以不需要任何外部或 3rd 方库。此外,它是一个函数,旨在精确完成您要完成的任务(接受字符集作为输入,并输出备用字符集,特别是从 UTF-8 到 ASCII)。以下是如何调用此函数的示例:

$clean_ascii_output = iconv('UTF-8', 'ASCII//TRANSLIT', $utf8_input);

More information about the specifics of this PHP function can be found here: http://php.net/manual/en/function.iconv.php

有关此 PHP 函数细节的更多信息,请访问:http: //php.net/manual/en/function.iconv.php

Note: The iconv function accepts string inputs, so you'll want to iterate over data, and parse it such that you are passing in a string input.

注意: iconv 函数接受字符串输入,因此您需要遍历数据并解析它,以便传入字符串输入。

回答by Alix Axel

I coded this function which uses the HTML entities translation table built-in into PHP to romanizechars:

我编写了这个函数,它使用 PHP 内置的 HTML 实体转换表来罗马化字符:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

It works by applying htmlentities()and then removing common entities suffixes, a simple example:

它通过应用htmlentities()然后删除常见实体后缀来工作,一个简单的例子:

 - ? = ã -> a
 - ? = Ã -> A
 - ? = õ -> o
 - ? = Õ -> O
 - ? = æ  -> ae
 - ? = Æ  -> AE

Beware that for this to work properly your files need to be encoded in UTF-8 (no BOM obviously).

请注意,要使其正常工作,您的文件需要以 UTF-8 编码(显然没有 BOM)。

See also my other answerfor another example.

另请参阅我的其他答案以获取另一个示例。

回答by sanette

Saw this old question and still don't know what the best answer is. In case it can help others, here is a array I made up automatically from

看到这个老问题,仍然不知道最佳答案是什么。如果它可以帮助其他人,这里是我自动组成的一个数组

http://www.fileformat.info/info/charset/UTF-8/list.htm

http://www.fileformat.info/info/charset/UTF-8/list.htm

array ("à" => "A",
"á" => "A",
"?" => "A",
"?" => "A",
"?" => "A",
"?" => "A",
"?" => "AE",
"?" => "C",
"è" => "E",
"é" => "E",
"ê" => "E",
"?" => "E",
"ì" => "I",
"í" => "I",
"?" => "I",
"?" => "I",
"D" => "ETH",
"?" => "N",
"ò" => "O",
"ó" => "O",
"?" => "O",
"?" => "O",
"?" => "O",
"?" => "O",
"ù" => "U",
"ú" => "U",
"?" => "U",
"ü" => "U",
"Y" => "Y",
"T" => "THORN",
"?" => "s",
"à" => "a",
"á" => "a",
"a" => "a",
"?" => "a",
"?" => "a",
"?" => "a",
"?" => "ae",
"?" => "c",
"è" => "e",
"é" => "e",
"ê" => "e",
"?" => "e",
"ì" => "i",
"í" => "i",
"?" => "i",
"?" => "i",
"e" => "eth",
"?" => "n",
"ò" => "o",
"ó" => "o",
"?" => "o",
"?" => "o",
"?" => "o",
"?" => "o",
"ù" => "u",
"ú" => "u",
"?" => "u",
"ü" => "u",
"y" => "y",
"t" => "thorn",
"?" => "y",
"ā" => "A",
"ā" => "a",
"?" => "A",
"?" => "a",
"?" => "A",
"?" => "a",
"?" => "C",
"?" => "c",
"?" => "C",
"?" => "c",
"?" => "C",
"?" => "c",
"?" => "C",
"?" => "c",
"?" => "D",
"?" => "d",
"?" => "D",
"?" => "d",
"ē" => "E",
"ē" => "e",
"?" => "E",
"?" => "e",
"?" => "E",
"?" => "e",
"?" => "E",
"?" => "e",
"ě" => "E",
"ě" => "e",
"?" => "G",
"?" => "g",
"?" => "G",
"?" => "g",
"?" => "G",
"?" => "g",
"?" => "G",
"?" => "g",
"?" => "H",
"?" => "h",
"?" => "H",
"?" => "h",
"?" => "I",
"?" => "i",
"ī" => "I",
"ī" => "i",
"?" => "I",
"?" => "i",
"?" => "I",
"?" => "i",
"?" => "I",
"?" => "i",
"?" => "J",
"?" => "j",
"?" => "K",
"?" => "k",
"?" => "kra",
"?" => "L",
"?" => "l",
"?" => "L",
"?" => "l",
"?" => "L",
"?" => "l",
"?" => "L",
"?" => "l",
"?" => "L",
"?" => "l",
"?" => "N",
"ń" => "n",
"?" => "N",
"?" => "n",
"?" => "N",
"ň" => "n",
"?" => "n",
"?" => "ENG",
"?" => "eng",
"ō" => "O",
"ō" => "o",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "R",
"?" => "r",
"?" => "R",
"?" => "r",
"?" => "R",
"?" => "r",
"?" => "S",
"?" => "s",
"?" => "S",
"?" => "s",
"?" => "S",
"?" => "s",
"?" => "S",
"?" => "s",
"?" => "T",
"?" => "t",
"?" => "T",
"?" => "t",
"?" => "T",
"?" => "t",
"?" => "U",
"?" => "u",
"ū" => "U",
"ū" => "u",
"?" => "U",
"?" => "u",
"?" => "U",
"?" => "u",
"?" => "U",
"?" => "u",
"?" => "U",
"?" => "u",
"?" => "W",
"?" => "w",
"?" => "Y",
"?" => "y",
"?" => "Y",
"?" => "Z",
"?" => "z",
"?" => "Z",
"?" => "z",
"?" => "Z",
"?" => "z",
"?" => "s",
"?" => "b",
"?" => "B",
"?" => "B",
"?" => "b",
"?" => "SIX",
"?" => "six",
"?" => "O",
"?" => "C",
"?" => "c",
"?" => "D",
"?" => "D",
"?" => "D",
"?" => "d",
"?" => "delta",
"?" => "E",
"?" => "SCHWA",
"?" => "E",
"?" => "F",
"?" => "f",
"?" => "G",
"?" => "GAMMA",
"?" => "hv",
"?" => "IOTA",
"?" => "I",
"?" => "K",
"?" => "k",
"?" => "l",
"?" => "lambda",
"?" => "M",
"?" => "N",
"?" => "n",
"?" => "O",
"?" => "O",
"?" => "o",
"?" => "OI",
"?" => "oi",
"?" => "P",
"?" => "p",
"?" => "TWO",
"?" => "two",
"?" => "ESH",
"?" => "t",
"?" => "T",
"?" => "t",
"?" => "T",
"?" => "U",
"?" => "u",
"?" => "UPSILON",
"?" => "V",
"?" => "Y",
"?" => "y",
"?" => "Z",
"?" => "z",
"?" => "EZH",
"?" => "EZH",
"?" => "ezh",
"?" => "ezh",
"?" => "FIVE",
"?" => "five",
"?" => "DZ",
"?" => "D",
"?" => "dz",
"?" => "LJ",
"?" => "L",
"?" => "lj",
"?" => "NJ",
"?" => "N",
"?" => "nj",
"ǎ" => "A",
"ǎ" => "a",
"ǐ" => "I",
"ǐ" => "i",
"ǒ" => "O",
"ǒ" => "o",
"ǔ" => "U",
"ǔ" => "u",
"ǖ" => "U",
"ǖ" => "u",
"ǘ" => "U",
"ǘ" => "u",
"ǚ" => "U",
"ǚ" => "u",
"ǜ" => "U",
"ǜ" => "u",
"?" => "e",
"?" => "A",
"?" => "a",
"?" => "A",
"?" => "a",
"?" => "AE",
"?" => "ae",
"?" => "G",
"?" => "g",
"?" => "G",
"?" => "g",
"?" => "K",
"?" => "k",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "EZH",
"?" => "ezh",
"?" => "j",
"?" => "DZ",
"?" => "D",
"?" => "dz",
"?" => "G",
"?" => "g",
"?" => "HWAIR",
"?" => "WYNN",
"?" => "N",
"?" => "n",
"?" => "A",
"?" => "a",
"?" => "AE",
"?" => "ae",
"?" => "O",
"?" => "o",
"?" => "A",
"?" => "a",
"?" => "A",
"?" => "a",
"?" => "E",
"?" => "e",
"?" => "E",
"?" => "e",
"?" => "I",
"?" => "i",
"?" => "I",
"?" => "i",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "R",
"?" => "r",
"?" => "R",
"?" => "r",
"?" => "U",
"?" => "u",
"?" => "U",
"?" => "u",
"?" => "S",
"?" => "s",
"?" => "T",
"?" => "t",
"?" => "YOGH",
"?" => "yogh",
"?" => "H",
"?" => "h",
"?" => "N",
"?" => "d",
"?" => "OU",
"?" => "ou",
"?" => "Z",
"?" => "z",
"?" => "A",
"?" => "a",
"?" => "E",
"?" => "e",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "O",
"?" => "o",
"?" => "Y",
"?" => "y",
"?" => "l",
"?" => "n",
"?" => "t",
"?" => "j",
"?" => "db",
"?" => "qp",
"?" => "A",
"?" => "C",
"?" => "c",
"?" => "L",
"?" => "T",
"?" => "s",
"?" => "z",
"?" => "STOP",
"?" => "stop",
"?" => "B",
"?" => "U",
"?" => "V",
"?" => "E",
"?" => "e",
"?" => "J",
"?" => "j",
"?" => "Q",
"?" => "q",
"?" => "R",
"?" => "r",
"?" => "Y",
"?" => "y",
"?" => "a",
"ɑ" => "alpha",
"?" => "alpha",
"?" => "b",
"?" => "o",
"?" => "c",
"?" => "d",
"?" => "d",
"?" => "e",
"?" => "schwa",
"?" => "schwa",
"?" => "e",
"?" => "e",
"?" => "e",
"?" => "e",
"?" => "j",
"?" => "g",
"ɡ" => "script",
"?" => "gamma",
"?" => "rams",
"?" => "h",
"?" => "h",
"?" => "heng",
"?" => "i",
"?" => "iota",
"?" => "l",
"?" => "l",
"?" => "l",
"?" => "lezh",
"?" => "m",
"?" => "m",
"?" => "m",
"?" => "n",
"?" => "n",
"?" => "barred",
"?" => "omega",
"?" => "phi",
"?" => "r",
"?" => "r",
"?" => "r",
"?" => "r",
"?" => "r",
"?" => "r",
"?" => "r",
"?" => "s",
"?" => "esh",
"?" => "j",
"?" => "squat",
"?" => "esh",
"?" => "t",
"?" => "t",
"?" => "u",
"?" => "upsilon",
"?" => "v",
"?" => "v",
"?" => "w",
"?" => "y",
"?" => "z",
"?" => "z",
"?" => "ezh",
"?" => "ezh",
"?" => "e",
"?" => "k",
"?" => "q",
"?" => "dz",
"?" => "dezh",
"?" => "dz",
"?" => "ts",
"?" => "tesh",
"?" => "tc",
"?" => "feng",
"?" => "ls",
"?" => "lz",
"?" => "h",
"?" => "h")

回答by infralabs

I hope this will be useful for anybody: https://github.com/infralabs/DiacriticsRemovePHP

我希望这对任何人都有用:https: //github.com/infralabs/DiacriticsRemovePHP

This class removes diacritics from strings containing Latin-1 Supplement, Latin Extended-A and Latin Extended-B special characters.

此类从包含 Latin-1 Supplement、Latin Extended-A 和 Latin Extended-B 特殊字符的字符串中删除变音符号。

usage:

用法:

$specialCharacters = "";
$specialCharacters .= "Latin-1 Supplement".PHP_EOL;
$specialCharacters .= "àá??????èéê?ìí??D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?".PHP_EOL;
$specialCharacters .= "Latin Extended-A".PHP_EOL;
$specialCharacters .= "āā????????????????ēē??????ěě??????????????īī????????????????????????ń???ň???ōō????????????????????????????ūū????????????????????".PHP_EOL;
$specialCharacters .= "Latin Extended-B".PHP_EOL;
$specialCharacters .= "???????".PHP_EOL;
$specialCharacters .= "Latin Extended Additional".PHP_EOL;
$specialCharacters .= "????????".PHP_EOL;

print "<pre>";
print removeDiacritics($specialCharacters).PHP_EOL;
print "</pre>";

source:

来源:

Latin-1 Supplement

àá??????èéê?ìí??D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?

Latin Extended-A

āā????????????????ēē??????ěě??????????????īī????????????????????????ń???ň???ōō????????????????????????????ūū????????????????????

Latin Extended-B

???????

Latin Extended Additional

????????

拉丁语 1 补充

àá??????èéê?ìí??D?òó???×?ùú?üYT?àáa?????èéê?ìí??e?òó???÷?ùú?üyt?

拉丁文扩展-A

āā?????????????????????ē??????ěě???????????????ī???????????? ????????????????????????ō???????????????????????????? ?ū?????????????????????????

拉丁文扩展-B

???????

拉丁语扩展附加

?????????

result:

结果:

Latin-1 Supplement

AAAAAAAECEEEEIIIIDNOOOOO×OUUUUYTHssaaaaaaaeceeeeiiiidnooooo÷ouuuuythy

Latin Extended-A

AaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIi?ijJjKk?LlLlLlLlLlNnNnNnnNnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzs

Latin Extended-B

fAaAEaeOo

Latin Extended Additional

WwWwWwYy

拉丁语 1 补充

AAAAAAAECEEEEIIIIDNOOOOO×OUUUUYTHssaaaaaaaeceeeeeiiiidnooooo÷ouuuuythy

拉丁文扩展-A

啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊!

拉丁文扩展-B

faaaeaoo

拉丁语扩展附加

万维网

回答by Micha? Kosmulski

The most generic way to solve this is to use Unicode Normalizationas it works automatically on all accents - you don't have to prepare the list up front. I don't know if it's easily available in PHP, I have used it in C and Java. Essentially, you first transform the string so that all accented characters are represented by regular character plus so-called composing diacritical mark (a built-in or external library should provide this function), and then remove the composing diacritics (using a specialized library, using character properties the language provides or using some regular expression extensions).

解决此问题的最通用方法是使用Unicode 规范化,因为它可以自动处理所有重音符号 - 您不必预先准备列表。我不知道它在 PHP 中是否容易获得,我已经在 C 和 Java 中使用过它。本质上,您首先转换字符串,以便所有重音字符都由常规字符加上所谓的组合变音符号表示(内置或外部库应提供此功能),然后删除组合变音符号(使用专门的库,使用语言提供的字符属性或使用一些正则表达式扩展)。