php 如何删除重音符号并将字母转换为“普通”ASCII 字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3542717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 10:10:11  来源:igfitidea点击:

How to remove accents and turn letters into "plain" ASCII characters?

phpregexstringascii

提问by Mark Lalor

What is the most efficient way to remove accents from a string e.g. èau?becomes Eaun?

从字符串中删除重音的最有效方法是什么,例如èau?变成Eaun

Is there a simple, built in way that I'm missing or a regular expression?

是否有我缺少的简单内置方式或正则表达式?

回答by Piskvor left the building

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

如果你安装了 iconv,试试这个(这个例子假设你的输入字符串是 UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

( iconv 是一个可以在各种编码之间进行转换的库;它非常高效,并且默认包含在许多 PHP 发行版中。最重要的是,与尝试推出自己的解决方案相比,它绝对更容易且更防错(您知道吗? “带卷曲的拉丁字母 N”?我都没有。))

回答by SimonSimCity

I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

我找到了一个解决方案,它适用于我所有的测试用例(从http://php.net/manual/en/transliterator.transliterate.php复制):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
    "A ? übérmensch p? h?yeste niv?! И я люблю PHP! есть. ? |"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "

see: http://www.php.net/normalizer

见:http: //www.php.net/normalizer

EDIT:This solution is independent of the locale set using setlocale(). Another benefit over iconv()is, that even non-latin characters are not ignored.

编辑:此解决方案独立于使用setlocale()设置的语言环境。与iconv() 相比的另一个好处是,即使是非拉丁字符也不会被忽略。

EDIT2:I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latintranslates the cyrillic character ьto a character, that doesn't fit into a latin character-set: ?(http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] removeto remove all these non-latin characters. I also added a test to the text ;)

EDIT2:我发现有些字符没有被我最初发布的音译所涵盖。Any-Latin转换西里尔字符ь一个字符,不适合拉丁字符集:?http://en.wikipedia.org/wiki/Prime_%28symbol%29)。我添加[\u0100-\u7fff] remove了删除所有这些非拉丁字符。我还在文本中添加了一个测试;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latinhere. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII...

我建议,他们的意思是拉丁字母,而不是Latin这里的拉丁字符集之一。但无论如何 - 在我看来,他们应该将它音译为 ASCII 的东西,然后在Latin-ASCII......

EDIT3:Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

EDIT3:对不起,这里有另一个变化。我不得不将字符降到 u0080 而不是 u0100,以仅获得 ASCII 字符作为输出。上面的测试已更新。

回答by neokio

Reposting this on request of @palantir ...

应@palantir 的要求重新发布此内容...

I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...

我发现 iconv 完全不可靠,而且我不喜欢 preg_replace 解决方案和大数组......所以我最喜欢的方法(也是我发现的唯一可靠的方法)是......

function toASCII( $str )
{
    return strtr(utf8_decode($str), 
        utf8_decode(
        '???????¥μàá??????èéê?ìí??D?òó????ùú?üY?àáa?????èéê?ìí??e?òó????ùú?üy?'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

回答by Gumbo

You can use iconvto transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

您可以使用iconv将字符音译为纯 US-ASCII,然后使用正则表达式删除非字母字符:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizerto normalize to the Normalization Form KD (NFKD)and then remove the mark characters:

另一种方法是使用一化器归一化为归一化形式 KD (NFKD),然后删除标记字符:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))

回答by Johnny Broadway

Note: I'm reposting this from another similar question in the hope that it's helpful to others.

注意:我是从另一个类似的问题中重新发布的,希望对其他人有所帮助。

I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:

我最终基于 Django 项目中的 URLify.js 编写了一个 PHP 库,因为我发现 iconv() 太不完整了。你可以在这里找到它:

https://github.com/jbroadway/urlify

https://github.com/jbroadway/urlify

Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.

处理拉丁字符以及希腊语、土耳其语、俄语、乌克兰语、捷克语、波兰语和拉脱维亚语。