如何从字符串中去除非 ASCII 字符？（在 C# 中）

Question

提问by philcruz

How can you strip non-ASCII characters from a string? (in C#)

Answer 1

采纳答案by philcruz

string s = "s?me string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

Answer 2

回答by bzlm

Here is a pure .NET solution that doesn't use regular expressions:

这是一个不使用正则表达式的纯 .NET 解决方案：

string inputString = "R?ksm?rg?s";
string asAscii = Encoding.ASCII.GetString(
    Encoding.Convert(
        Encoding.UTF8,
        Encoding.GetEncoding(
            Encoding.ASCII.EncodingName,
            new EncoderReplacementFallback(string.Empty),
            new DecoderExceptionFallback()
            ),
        Encoding.UTF8.GetBytes(inputString)
    )
);

It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.

它可能看起来很麻烦，但它应该是直观的。它使用 .NET ASCII 编码来转换字符串。在转换过程中使用 UTF8，因为它可以表示任何原始字符。它使用 EncoderReplacementFallback 将任何非 ASCII 字符转换为空字符串。

Answer 3

回答by Bent Rasmussen

Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution

受philcruz 的正则表达式解决方案的启发，我做了一个纯 LINQ 解决方案

public static string PureAscii(this string source, char nil = ' ')
{
    var min = '\u0000';
    var max = '\u007F';
    return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}

public static string ToText(this IEnumerable<char> source)
{
    var buffer = new StringBuilder();
    foreach (var c in source)
        buffer.Append(c);
    return buffer.ToString();
}

This is untested code.

这是未经测试的代码。

Answer 4

回答by sinelaw

If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. ü to U)

如果您不想剥离，而是实际将拉丁重音字符转换为非重音字符，请查看以下问题：如何将 8 位字符转换为 7 位字符？（即ü到U）

Answer 5

回答by Anonymous coward

I used this regex expression:

我使用了这个正则表达式：

    string s = "s?me string";
    Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
    return regex.Replace(s, "");

Answer 6

回答by MonsCamus

I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.

我发现以下稍微改变的范围对于解析数据库中的注释块很有用，这意味着您不必与会导致 CSV 字段变得混乱的制表符和转义符争用。

parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);

If you want to avoid other special characters or particular punctuation check the ascii table

如果您想避免其他特殊字符或特定标点符号，请检查ascii 表

Answer 7

回答by rjp

no need for regex. just use encoding...

不需要正则表达式。只需使用编码...

sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));

Answer 8

回答by Josh

I believe MonsCamus meant:

我相信 MonsCamus 的意思是：

parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);

Answer 9

回答by Jeppe Stig Nielsen

This is not optimal performance-wise, but a pretty straight-forward Linq approach:

这不是最佳性能，而是一种非常直接的 Linq 方法：

string strippedString = new string(
    yourString.Where(c => c <= sbyte.MaxValue).ToArray()
    );

The downside is that all the "surviving" characters are first put into an array of type char[]which is then thrown away after the stringconstructor no longer uses it.

缺点是所有“幸存”的字符首先被放入一个类型的数组中char[]，然后在string构造函数不再使用它后将其丢弃。

Answer 10

回答by Polynomial Proton

I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255which is the ISO 8859-1

我来这里寻找扩展 ascii 字符的解决方案，但找不到。我找到的最接近的是bzlm's solution。但这仅适用于高达 127 的 ASCII 代码（显然您可以替换他代码中的编码类型，但我认为理解起来有点复杂。因此，共享此版本）。这是一个适用于扩展 ASCII 代码的解决方案，即高达 255，即 ISO 8859-1

It finds and strips out non-ascii characters(greater than 255)

它查找并去除非 ascii 字符（大于 255）

Dim str1 as String= "a, ??? or ?u? n?i?++$-?!??4?od;/?'?;?:?)///1!@#"

Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1", 
                                                New EncoderReplacementFallback(String.empty),
                                                New DecoderReplacementFallback())

Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)

Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)

console.WriteLine(str2)
'Output : a, ??? or ?u ni++$-!??4od;/';:)///1!@#$%^yz:

Here's a working fiddle for the code

这是代码的工作小提琴

Replace the encoding as per the requirement, rest should remain the same.

根据要求更换编码，其余保持不变。

如何从字符串中去除非 ASCII 字符？（在 C# 中）

提问by philcruz

采纳答案by philcruz

回答by bzlm

回答by Bent Rasmussen

回答by sinelaw

回答by Anonymous coward

回答by MonsCamus

回答by rjp

回答by Josh

回答by Jeppe Stig Nielsen

回答by Polynomial Proton

相关推荐

最近更新

标签

如何从字符串中去除非 ASCII 字符？（在 C# 中）

提问by philcruz

采纳答案by philcruz

回答by bzlm

回答by Bent Rasmussen

回答by sinelaw

回答by Anonymous coward

回答by MonsCamus

回答by rjp

回答by Josh

回答by Jeppe Stig Nielsen

回答by Polynomial Proton

相关推荐

C# 我的图片的位置

如何使用 Vb.NET 或 C# 终止进程？

C# 中是否有“匿名”通用标签，例如“?” 在爪哇？

C#泛型有性能优势吗？

相关推荐

最近更新

标签