如何从 .NET 中的字符串中删除变音符号(重音符号)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/249087/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I remove diacritics (accents) from a string in .NET?
提问by James Hall
I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert éto e, so crème br?léewould become creme brulee)
我正在尝试转换一些加拿大法语字符串,基本上,我希望能够在保留字母的同时去除字母中的法语重音符号。(例如转换é为e,所以crème br?lée会变成creme brulee)
What is the best method for achieving this?
实现这一目标的最佳方法是什么?
回答by Blair Conrad
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)
我没有使用过这种方法,但 Michael Kaplan 在他的博客文章(标题令人困惑)中描述了一种这样做的方法,该文章谈到了剥离变音符号:剥离是一项有趣的工作(又名无意义的含义,又名所有 Mn 字符)是非间距的,但有些比其他的更非间距)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Note that this is a followup to his earlier post: Stripping diacritics....
请注意,这是他之前的帖子的后续:剥离变音符号....
The approach uses String.Normalizeto split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
该方法使用String.Normalize将输入字符串拆分为组成字形(基本上将“基本”字符与变音符号分开),然后扫描结果并仅保留基本字符。这只是有点复杂,但实际上你正在研究一个复杂的问题。
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
当然,如果您将自己限制为法语,那么您可能可以按照@David Dibben 的建议使用如何在 C++ std::string 中删除重音和波浪号中的简单基于表格的方法。
回答by azrafe7
this did the trick for me...
这对我有用...
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
quick&short!
快速&简短!
回答by Luk
In case someone is interested, I was looking for something similar and ended writing the following:
如果有人感兴趣,我正在寻找类似的东西并结束写下以下内容:
public static string NormalizeStringForUrl(string name)
{
String normalizedString = name.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in normalizedString)
{
switch (CharUnicodeInfo.GetUnicodeCategory(c))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
stringBuilder.Append(c);
break;
case UnicodeCategory.SpaceSeparator:
case UnicodeCategory.ConnectorPunctuation:
case UnicodeCategory.DashPunctuation:
stringBuilder.Append('_');
break;
}
}
string result = stringBuilder.ToString();
return String.Join("_", result.Split(new char[] { '_' }
, StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
回答by CIRCLE
I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str)into C# that is easily customisable:
我需要一些可以转换所有主要 unicode 字符的东西,但投票的答案遗漏了一些,所以我已经将 CodeIgniter 的一个版本创建convert_accented_characters($str)为易于定制的 C#:
using System;
using System.Text;
using System.Collections.Generic;
public static class Strings
{
static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
{
{ "???", "ae" },
{ "??", "oe" },
{ "ü", "ue" },
{ "?", "Ae" },
{ "ü", "Ue" },
{ "?", "Oe" },
{ "àá?????ā??ǎΑ????????????А", "A" },
{ "àáa???ā??ǎaα?????????????а", "a" },
{ "Б", "B" },
{ "б", "b" },
{ "?????", "C" },
{ "?????", "c" },
{ "Д", "D" },
{ "д", "d" },
{ "D??Δ", "Dj" },
{ "e??δ", "dj" },
{ "èéê?ē???ěΕ?????????ЕЭ", "E" },
{ "èéê?ē???ě?ε????????еэ", "e" },
{ "Ф", "F" },
{ "ф", "f" },
{ "????ΓГ?", "G" },
{ "????γг?", "g" },
{ "??", "H" },
{ "??", "h" },
{ "ìí???ī?ǐ??Η??Ι???ИЫ", "I" },
{ "ìí???ī?ǐ??η??ι???иы?", "i" },
{ "?", "J" },
{ "?", "j" },
{ "?ΚК", "K" },
{ "?κк", "k" },
{ "?????ΛЛ", "L" },
{ "?????λл", "l" },
{ "М", "M" },
{ "м", "m" },
{ "????ΝН", "N" },
{ "?ń?ň?νн", "n" },
{ "òó??ō?ǒ????Ο?Ω?????????????О", "O" },
{ "òó??ō?ǒ????oο?ω?????????????о", "o" },
{ "П", "P" },
{ "п", "p" },
{ "???ΡР", "R" },
{ "???ρр", "r" },
{ "?????ΣС", "S" },
{ "??????σ?с", "s" },
{ "????τТ", "T" },
{ "????т", "t" },
{ "ùú??ū?????ǔǖǘǚǜ????????У", "U" },
{ "ùú??ū?????ǔǖǘǚǜυ?????????у", "u" },
{ "Y??Υ??????Й", "Y" },
{ "y??????й", "y" },
{ "В", "V" },
{ "в", "v" },
{ "?", "W" },
{ "?", "w" },
{ "???ΖЗ", "Z" },
{ "???ζз", "z" },
{ "??", "AE" },
{ "?", "ss" },
{ "?", "IJ" },
{ "?", "ij" },
{ "?", "OE" },
{ "?", "f" },
{ "ξ", "ks" },
{ "π", "p" },
{ "β", "v" },
{ "μ", "m" },
{ "ψ", "ps" },
{ "Ё", "Yo" },
{ "ё", "yo" },
{ "?", "Ye" },
{ "?", "ye" },
{ "?", "Yi" },
{ "Ж", "Zh" },
{ "ж", "zh" },
{ "Х", "Kh" },
{ "х", "kh" },
{ "Ц", "Ts" },
{ "ц", "ts" },
{ "Ч", "Ch" },
{ "ч", "ch" },
{ "Ш", "Sh" },
{ "ш", "sh" },
{ "Щ", "Shch" },
{ "щ", "shch" },
{ "ЪъЬь", "" },
{ "Ю", "Yu" },
{ "ю", "yu" },
{ "Я", "Ya" },
{ "я", "ya" },
};
public static char RemoveDiacritics(this char c){
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
return entry.Value[0];
}
}
return c;
}
public static string RemoveDiacritics(this string s)
{
//StringBuilder sb = new StringBuilder ();
string text = "";
foreach (char c in s)
{
int len = text.Length;
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
text += entry.Value;
break;
}
}
if (len == text.Length) {
text += c;
}
}
return text;
}
}
Usage
用法
// for strings
"crème br?lée".RemoveDiacritics (); // creme brulee
// for chars
"?"[0].RemoveDiacritics (); // A
回答by KenE
In case anyone's interested, here is the java equivalent:
如果有人感兴趣,这里是 java 等价物:
import java.text.Normalizer;
public class MyClass
{
public static String removeDiacritics(String input)
{
String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
StringBuilder stripped = new StringBuilder();
for (int i=0;i<nrml.length();++i)
{
if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
{
stripped.append(nrml.charAt(i));
}
}
return stripped.toString();
}
}
回答by realbart
I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:
我经常使用基于我在此处找到的另一个版本的扩展方法(请参阅替换 C# (ascii) 中的字符)快速解释:
- Normalizing to form D splits charactes like èto an eand a nonspacing `
- From this, the nospacing characters are removed
- The result is normalized back to form C (I'm not sure if this is neccesary)
- 规范化形成 D 将è 等字符拆分为e和非间距`
- 从此,删除了 nospacing 字符
- 结果被归一化为 C 形式(我不确定这是否必要)
Code:
代码:
using System.Linq;
using System.Text;
using System.Globalization;
// namespace here
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
return cleanStr;
}
// or, alternatively
public static string RemoveDiacritics2(this string str)
{
if (null == str) return null;
var chars = str
.Normalize(NormalizationForm.FormD)
.ToCharArray()
.Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
}
回答by Sergio Cabral
The CodePage of Greek (ISO)can do it
希腊语 (ISO)的 CodePage可以做到
The information about this codepage is into System.Text.Encoding.GetEncodings(). Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
有关此代码页的信息在System.Text.Encoding.GetEncodings(). 了解在:https: //msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
Greek (ISO) has codepage 28597and name iso-8859-7.
希腊语 (ISO) 的代码页为28597,名称为iso-8859-7。
Go to the code... \o/
转到代码... \o/
string text = "Você está numa situa??o lamentável";
string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"
string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"
So, write this function...
所以,写这个函数...
public string RemoveAcentuation(string text)
{
return
System.Web.HttpUtility.UrlDecode(
System.Web.HttpUtility.UrlEncode(
text, Encoding.GetEncoding("iso-8859-7")));
}
Note that... Encoding.GetEncoding("iso-8859-7")is equivalent to Encoding.GetEncoding(28597)because first is the name, and second the codepage of Encoding.
请注意,...Encoding.GetEncoding("iso-8859-7")等价于Encoding.GetEncoding(28597)因为第一个是名称,第二个是编码的代码页。
回答by EricBDev
It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.
有趣的是,这样的问题可以得到如此多的答案,但没有一个符合我的要求:) 周围有这么多语言,AFAIK 不可能真正实现与语言无关的完整解决方案,因为其他人已经提到 FormC 或 FormD 存在问题。
Since the original question was related to French, the simplest working answer is indeed
由于原始问题与法语有关,因此最简单的工作答案确实是
public static string ConvertWesternEuropeanToASCII(this string str)
{
return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
}
1251 should be replaced by the encoding code of the input language.
1251 应替换为输入语言的编码代码。
This however replace only one character by one character. Since I am also working with German as input, I did a manual convert
然而,这仅用一个字符替换一个字符。由于我也使用德语作为输入,因此我进行了手动转换
public static string LatinizeGermanCharacters(this string str)
{
StringBuilder sb = new StringBuilder(str.Length);
foreach (char c in str)
{
switch (c)
{
case '?':
sb.Append("ae");
break;
case '?':
sb.Append("oe");
break;
case 'ü':
sb.Append("ue");
break;
case '?':
sb.Append("Ae");
break;
case '?':
sb.Append("Oe");
break;
case 'ü':
sb.Append("Ue");
break;
case '?':
sb.Append("ss");
break;
default:
sb.Append(c);
break;
}
}
return sb.ToString();
}
It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.
它可能无法提供最佳性能,但至少它非常易于阅读和扩展。正则表达式是不可行的,比任何字符/字符串都慢得多。
I also have a very simple method to remove space:
我还有一个非常简单的方法来删除空间:
public static string RemoveSpace(this string str)
{
return str.Replace(" ", string.Empty);
}
Eventually, I am using a combination of all 3 above extensions:
最终,我使用了上述所有 3 个扩展的组合:
public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
{
str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();
return keepSpace ? str : str.RemoveSpace();
}
And a small unit test to that (not exhaustive) which pass successfully.
以及一个成功通过的小型单元测试(并非详尽无遗)。
[TestMethod()]
public void LatinizeAndConvertToASCIITest()
{
string europeanStr = "Bonjour ?a va? C'est l'été! Ich m?chte ? ? á à a ê é è ? ? é ? ? ? í ì ó ò ? ? ? ü ü ù ú ? ? y Y ? ? ? ?";
string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
string actual = europeanStr.LatinizeAndConvertToASCII();
Assert.AreEqual(expected, actual);
}
回答by hashable
This works fine in java.
这在java中工作正常。
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
它基本上将所有重音字符转换为它们的 deAccented 对应物,然后是它们的组合变音符号。现在您可以使用正则表达式去除变音符号。
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
回答by Andy Raddatz
TL;DR - C# string extension method
TL;DR - C# 字符串扩展方法
I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them, which is well illustrated in the example crème br?léeto crme brlevs. creme brulee.
我想保存字符串的含义最好的解决办法是将字符,而不是转化剥夺他们,这是在本例中很好的说明中crème br?lée,以crme brle对creme brulee。
I checked out Alexander's comment aboveand saw the Lucene.Net code is Apache 2.0 licensed, so I've modified the class into a simple string extension method. You can use it like this:
我查看了上面 Alexander 的评论,看到 Lucene.Net 代码是 Apache 2.0 许可的,因此我将该类修改为简单的字符串扩展方法。你可以这样使用它:
var originalString = "crème br?lée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength);
// "creme brulee"
The function is too long to post in a StackOverflow answer (~139k characters of 30k allowed lol) so I made a gist and attributed the authors:
该函数太长,无法在 StackOverflow 答案中发布(允许 30k 的约 139k 个字符,哈哈)所以我做了一个要点并归于作者:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
///
/// <ul>
/// <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
/// <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
/// <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
/// <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
/// <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
/// <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
/// <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
/// <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
/// <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
/// <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
/// <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
/// <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
/// <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
/// <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
/// <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
/// <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
/// <summary>
/// Converts characters above ASCII to their ASCII equivalents. For example,
/// accents are removed from accented characters.
/// </summary>
/// <param name="input"> The string of characters to fold </param>
/// <param name="length"> The length of the folded return string </param>
/// <returns> length of output </returns>
public static string FoldToASCII(this string input, int? length = null)
{
// See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
}
}
Hope that helps someone else, this is the most robust solution I've found!
希望对其他人有帮助,这是我找到的最强大的解决方案!

