C# 如何修复空格的 UTF 编码？

Question

提问by omega

In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src);to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.

在我的 C# 代码中，我从 PDF 文档中提取文本。当我这样做时，我得到一个 UTF-8 或 Unicode 编码的字符串（我不确定是哪个）。当我Encoding.UTF8.GetBytes(src);将其转换为字节数组时，我注意到空格实际上是两个字符，字节值为 194 和 160。

For example the string "CLE action" looks like

例如字符串“CLE action”看起来像

[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]

in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action");is returning -1 when I need it to return 1.

在字节数组中，其中空格是 194 和 160 ......因此src.IndexOf("CLE action");当我需要它返回 1 时返回 -1。

How can I fix the encoding of the string?

如何修复字符串的编码？

Answer 1

采纳答案by RichieHindle

194 160is the UTF-8 encoding of a NO-BREAK SPACEcodepoint (the same codepoint that HTML calls  ).

194 160是NO-BREAK SPACE代码点的 UTF-8 编码（与 HTML 调用的代码点相同 ）。

So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \swould match it, but a plain comparison with a space won't.

所以它真的不是一个空间，即使它看起来像一个空间。（例如，您会看到它不会自动换行。）正则表达式匹配 for\s会匹配它，但与空格的简单比较不会。

To simply replace NO-BREAK spaces you can do the following:

要简单地替换 NO-BREAK 空格，您可以执行以下操作：

src = src.Replace('\u00A0', ' ');

Answer 2

回答by Jonas Sch?fer

Interpreting \xC2\xA0(=194, 160) as UTF8 actually yields \xA0which is unicode non-breaking space. This is a different character than ordinary space and thus, doesn't match ordinary spaces. You have to match against the non-breaking space or use fuzzy-matching against any whitespace.

将\xC2\xA0(= 194, 160)解释为 UTF8 实际上会产生\xA0unicode non-breaking space。这是与普通空格不同的字符，因此与普通空格不匹配。您必须匹配不间断空格或对任何空格使用模糊匹配。

Answer 3

回答by Kevin

In UTF8 character value c2 a0 (194 160) is defined as NO-BREAK SPACE. According to ISO/IEC 8859 this is a space that does not allow a line break to be inserted. Normally text processing software assumes that a line break can be inserted at any white space character (this is how word wrap is normally implemented). You should be able to simply do a replace in your string of the characters with a normal space to fix the problem.

在 UTF8 字符值 c2 a0 (194 160) 中定义为无间断空间。根据 ISO/IEC 8859，这是一个不允许插入换行符的空格。通常文本处理软件假定可以在任何空白字符处插入换行符（这就是通常实现自动换行的方式）。您应该能够简单地用普通空格替换字符串中的字符来解决问题。

C# 如何修复空格的 UTF 编码？

提问by omega

采纳答案by RichieHindle

回答by Jonas Sch?fer

回答by Kevin

相关推荐

最近更新

标签

C# 如何修复空格的 UTF 编码？

提问by omega

采纳答案by RichieHindle

回答by Jonas Sch?fer

回答by Kevin

相关推荐

如何在远程服务器上安装 C# Windows 服务？

C# 在 Windows Phone 8 下获取唯一设备 ID (UDID)

C# 如何使用javascript获取会话值

C# 如何在不更改原始列表的情况下更改我的新列表？

相关推荐

最近更新

标签