C# 如何修复空格的 UTF 编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13992934/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to fix UTF encoding for whitespaces?
提问by omega
In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src);
to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.
在我的 C# 代码中,我从 PDF 文档中提取文本。当我这样做时,我得到一个 UTF-8 或 Unicode 编码的字符串(我不确定是哪个)。当我Encoding.UTF8.GetBytes(src);
将其转换为字节数组时,我注意到空格实际上是两个字符,字节值为 194 和 160。
For example the string "CLE action" looks like
例如字符串“CLE action”看起来像
[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]
in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action");
is returning -1 when I need it to return 1.
在字节数组中,其中空格是 194 和 160 ......因此src.IndexOf("CLE action");
当我需要它返回 1 时返回 -1。
How can I fix the encoding of the string?
如何修复字符串的编码?
采纳答案by RichieHindle
194 160
is the UTF-8 encoding of a NO-BREAK SPACE
codepoint (the same codepoint that HTML calls
).
194 160
是NO-BREAK SPACE
代码点的 UTF-8 编码(与 HTML 调用的代码点相同
)。
So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s
would match it, but a plain comparison with a space won't.
所以它真的不是一个空间,即使它看起来像一个空间。(例如,您会看到它不会自动换行。)正则表达式匹配 for\s
会匹配它,但与空格的简单比较不会。
To simply replace NO-BREAK spaces you can do the following:
要简单地替换 NO-BREAK 空格,您可以执行以下操作:
src = src.Replace('\u00A0', ' ');
回答by Jonas Sch?fer
Interpreting \xC2\xA0
(=194, 160
) as UTF8 actually yields \xA0
which is unicode non-breaking space. This is a different character than ordinary space and thus, doesn't match ordinary spaces. You have to match against the non-breaking space or use fuzzy-matching against any whitespace.
将\xC2\xA0
(= 194, 160
)解释为 UTF8 实际上会产生\xA0
unicode non-breaking space。这是与普通空格不同的字符,因此与普通空格不匹配。您必须匹配不间断空格或对任何空格使用模糊匹配。
回答by Kevin
In UTF8 character value c2 a0 (194 160) is defined as NO-BREAK SPACE. According to ISO/IEC 8859 this is a space that does not allow a line break to be inserted. Normally text processing software assumes that a line break can be inserted at any white space character (this is how word wrap is normally implemented). You should be able to simply do a replace in your string of the characters with a normal space to fix the problem.
在 UTF8 字符值 c2 a0 (194 160) 中定义为无间断空间。根据 ISO/IEC 8859,这是一个不允许插入换行符的空格。通常文本处理软件假定可以在任何空白字符处插入换行符(这就是通常实现自动换行的方式)。您应该能够简单地用普通空格替换字符串中的字符来解决问题。