转换特殊字符,例如 ?? 和 ??回到 C# 中的原始拉丁字母对应物
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14980200/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting special charactes such as ?? and ?? back to their original, latin alphbet counterparts in C#
提问by Gga
I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes
such as & uuml;
and more problematic characters representing the same letters such as ??
and ??
. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú
and ó
.
我得到了一个 MySQL 数据库的导出,它的编码随着时间的推移似乎有些混乱,并且包含代表相同字母的诸如和更多问题字符的混合,HTML char codes
例如和。我的任务是恢复文件的一致性并将所有内容转换为正确的拉丁字符,例如和。& uuml;
??
??
ú
ó
An example of the sort of string I am dealing with is
我正在处理的字符串类型的一个例子是
Desinfektionsl????sungst????cher f????r Fl???¤chen
Desinfektionsl????sungst????cher f????r Fl???¤chen
Which should equate to
这应该等于
50 Tattoo Desinfektionsl ? sungst ü cher f ü r Fl ? chen
50 Tattoo Desinfektionsl ???? sungst ???? cher f ???? r Fl ???¤ chen
Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of ??
and ??
to UTF-8
?
C#/.Net 4.5 中是否有一种方法可以成功地重新编码??
和??
to 之类的东西UTF-8
?
Else what approach would be advisable?
否则什么方法是可取的?
Also is the paragraph character ?
in the above example string an actual paragraph character or part of some other character combination?
也就是段字符?
在上面的例子串某些其它字符组合的实际段字符或部分?
I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.
在需要查找和替换下面的内容时,我创建了一个查找表,但是我不确定它有多完整。
?‰ -> é
a? -> "
a -> "
?? -> ?
?? -> ?
??, 'é
? -> à
?o -> ú
a¢ -> -
?? -> ?
?μ -> ?
?- -> í
?¢ -> a
?£ -> ?
?a -> ê
?? -> á
?? -> é
?3 -> ó
a“ -> –
?§ -> ?
?a -> a
?o -> o
? -> à
采纳答案by Guffa
Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.
嗯,首先,由于使用错误的编码对数据进行了解码,因此可能无法恢复某些字符。它看起来像是使用 8 位编码错误解码的 UTF-8 数据。
There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.
没有内置的方法来恢复这样的数据,因为这不是您通常会做的事情。没有可靠的方法来解码数据,因为它已经损坏了。
What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:
您可以尝试的是对数据进行编码,然后再次使用错误的编码对其进行解码,反之亦然:
byte[] data = Encoding.Default.GetBytes(input);
string output = Encoding.UTF8.GetString(data);
The Encoding.Default
uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.
在Encoding.Default
使用你的系统当前ANSI编码。您可以在那里尝试一些不同的编码,看看哪种编码效果最好。
回答by el vis
It's probably windows-1252 encoded string which you read as UTF-8.
它可能是 windows-1252 编码的字符串,您将其读为 UTF-8。
As Guffa mentioned data has been corrupted.
正如古法所说,数据已损坏。
Lets take a look on bytes:
? -> C3B6 in UTF8
让我们上个字节的样子:
?-> UTF8 中的 C3B6
in windows-1252 C3 ->? B6 ->?
在 windows-1252 C3 ->? B6->?
so ? ->??
所以 ?->??
what about all these "??":
所有这些“??”怎么样:
? ->83 ? ->C2
? -> 83 ? ->C2
Honesty i don't know why they appear, but you can try erase them and do some conversions as Guffa mentioned. Good luck
老实说,我不知道它们为什么会出现,但是您可以尝试擦除它们并像 Guffa 提到的那样进行一些转换。祝你好运
回答by Esailija
The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.
由于具有 5 个未分配插槽的 Windows-1252 编码,数据仅部分不可恢复。Windows-1252 的一些修改用控制字符填充这些字符,但那些不会在 Stackoverflow 中发布。如果使用了经过修改的 Windows-1252,只要不丢失复制粘贴中隐藏的控制字符,就可以完全恢复。
There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.
还有一个不间断的空格字符,通常会被复制粘贴忽略或变成空格,但是当您直接处理字节时,这不是问题。
The misencoding abuse this string has gone through is:
该字符串所经历的错误编码滥用是:
UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252
To recover, here is an example:
要恢复,这里是一个例子:
String a = "Desinfektionsl????sungst????cher f????r Fl???¤chen";
Encoding utf8 = Encoding.GetEncoding(65001);
Encoding win1252 = Encoding.GetEncoding(1252);
string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a))));
Console.WriteLine(result);
//Desinfektionsl?sungstücher für Fl?chen
回答by Jorden van Foreest
Here you can find a completer list:
在这里你可以找到一个完整的列表:
http://bueltge.de/wp-content/download/wk/utf-8_kodierungen.pdf
http://bueltge.de/wp-content/download/wk/utf-8_kodierungen.pdf
回答by Alhan Ozdemir
I've been troubled by this char problem before. Solution:
我之前一直被这个字符问题困扰。解决方案:
My .(cs)html file was UTF-8; I converted to UTF-8Y (UTF-8 with a BOM).
我的 .(cs)html 文件是 UTF-8;我转换为 UTF-8Y(带有 BOM 的 UTF-8)。