转换特殊字符，例如 ?? 和？？回到 C# 中的原始拉丁字母对应物

Question

提问by Gga

I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codessuch as & uuml;and more problematic characters representing the same letters such as ??and ??. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. úand ó.

我得到了一个 MySQL 数据库的导出，它的编码随着时间的推移似乎有些混乱，并且包含代表相同字母的诸如和更多问题字符的混合，HTML char codes例如和。我的任务是恢复文件的一致性并将所有内容转换为正确的拉丁字符，例如和。& uuml;????úó

An example of the sort of string I am dealing with is

我正在处理的字符串类型的一个例子是

Desinfektionsl????sungst????cher f????r Fl???¤chen

Which should equate to

这应该等于

50 Tattoo Desinfektionsl ?    sungst ü    cher f ü    r Fl ?    chen 
50 Tattoo Desinfektionsl ???? sungst ???? cher f ???? r Fl ???¤ chen

Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of ??and ??to UTF-8?

C#/.Net 4.5 中是否有一种方法可以成功地重新编码??和??to 之类的东西UTF-8？

Else what approach would be advisable?

否则什么方法是可取的？

Also is the paragraph character ?in the above example string an actual paragraph character or part of some other character combination?

也就是段字符?在上面的例子串某些其它字符组合的实际段字符或部分？

I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.

在需要查找和替换下面的内容时，我创建了一个查找表，但是我不确定它有多完整。

?‰ -> é
a? -> "
a -> "
?? -> ?
?? -> ?
??, 'é
?  -> à
?o -> ú
a￠ -> -
?? -> ?
?μ -> ?
?- -> í
?￠ -> a
?￡ -> ?
?a -> ê
?? -> á
?? -> é
?3 -> ó
a“ -> –
?§ -> ?
?a -> a
?o -> o
?  -> à

Answer 1

采纳答案by Guffa

Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.

嗯，首先，由于使用错误的编码对数据进行了解码，因此可能无法恢复某些字符。它看起来像是使用 8 位编码错误解码的 UTF-8 数据。

There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.

没有内置的方法来恢复这样的数据，因为这不是您通常会做的事情。没有可靠的方法来解码数据，因为它已经损坏了。

What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:

您可以尝试的是对数据进行编码，然后再次使用错误的编码对其进行解码，反之亦然：

byte[] data = Encoding.Default.GetBytes(input);
string output = Encoding.UTF8.GetString(data);

The Encoding.Defaultuses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.

在Encoding.Default使用你的系统当前ANSI编码。您可以在那里尝试一些不同的编码，看看哪种编码效果最好。

Answer 2

回答by el vis

It's probably windows-1252 encoded string which you read as UTF-8.

它可能是 windows-1252 编码的字符串，您将其读为 UTF-8。

As Guffa mentioned data has been corrupted.

正如古法所说，数据已损坏。

Lets take a look on bytes:
? -> C3B6 in UTF8

让我们上个字节的样子：
？-> UTF8 中的 C3B6

in windows-1252 C3 ->? B6 ->?

在 windows-1252 C3 ->? B6->？

so ? ->??

所以？->??

what about all these "??":

所有这些“??”怎么样：

? ->83 ? ->C2

? -> 83 ? ->C2

Honesty i don't know why they appear, but you can try erase them and do some conversions as Guffa mentioned. Good luck

老实说，我不知道它们为什么会出现，但是您可以尝试擦除它们并像 Guffa 提到的那样进行一些转换。祝你好运

Answer 3

回答by Esailija

The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.

由于具有 5 个未分配插槽的 Windows-1252 编码，数据仅部分不可恢复。Windows-1252 的一些修改用控制字符填充这些字符，但那些不会在 Stackoverflow 中发布。如果使用了经过修改的 Windows-1252，只要不丢失复制粘贴中隐藏的控制字符，就可以完全恢复。

There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.

还有一个不间断的空格字符，通常会被复制粘贴忽略或变成空格，但是当您直接处理字节时，这不是问题。

The misencoding abuse this string has gone through is:

该字符串所经历的错误编码滥用是：

UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252

To recover, here is an example:

要恢复，这里是一个例子：

String a = "Desinfektionsl????sungst????cher f????r Fl???¤chen";
Encoding utf8 = Encoding.GetEncoding(65001);
Encoding win1252 = Encoding.GetEncoding(1252);

string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a))));

Console.WriteLine(result);
//Desinfektionsl?sungstücher für Fl?chen

Answer 4

回答by Jorden van Foreest

Here you can find a completer list:

在这里你可以找到一个完整的列表：

http://bueltge.de/wp-content/download/wk/utf-8_kodierungen.pdf

Answer 5

回答by Alhan Ozdemir

I've been troubled by this char problem before. Solution:

我之前一直被这个字符问题困扰。解决方案：

My .(cs)html file was UTF-8; I converted to UTF-8Y (UTF-8 with a BOM).

我的 .(cs)html 文件是 UTF-8；我转换为 UTF-8Y（带有 BOM 的 UTF-8）。

转换特殊字符，例如 ?? 和？？回到 C# 中的原始拉丁字母对应物

提问by Gga

采纳答案by Guffa

回答by el vis

回答by Esailija

回答by Jorden van Foreest

回答by Alhan Ozdemir

相关推荐

最近更新

标签

转换特殊字符，例如 ?? 和 ？？回到 C# 中的原始拉丁字母对应物

提问by Gga

采纳答案by Guffa

回答by el vis

回答by Esailija

回答by Jorden van Foreest

回答by Alhan Ozdemir

相关推荐

C# 从对象列表中删除对象

C# 字符串比较和单个字符的字母顺序

C# 在控制台窗口中显示数据库中的数据

C# Html Agility Pack 循环遍历表的行和列

相关推荐

最近更新

标签

转换特殊字符，例如 ?? 和？？回到 C# 中的原始拉丁字母对应物