vb.net 将损坏的 UTF-8 文本转换并更正为 ANSI?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22068189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert and correct corrupt UTF-8 text into ANSI?
提问by Gulbahar
I am not a professional developer, and am having a problem converting Unicode textto ANSIfound in a legacy application that doesn't support Unicode.
我不是专业开发人员,并且在转换Unicode text为ANSI在不支持Unicode.
Here's a sample of what a Unicode-encoded text looks like when displayed in that legacy application:
以下是Unicode在该遗留应用程序中显示的-encoded 文本的示例:
? chaque journ??e des quatre jours de colloque, entre 250 et 500 personnes sont venues assister en continu aux discussions de cette rencontre. Cette affluence, ainsi que la richesse et la vari??t?? des discussions engag??es lors de ces conf??rences, confirment la n??cessit?? d'un espace ouvert pour les pens??es critiques dans le monde francophone, ? l'universit?? et bien au-del? .
? 会议四天的每一天,都有250至500人不断前来参加本次会议的讨论。这种富裕,以及丰富多样 在这些会议期间发起的讨论确认需要 法语世界批判性思维的开放空间,?大学??远远超出?.
I notice the following things:
我注意到以下几点:
- All diacritic characters are encoded as C3 ("?") + a second byte
- The character "à" is wrongly encoded as C320 ("? ")
- Windows' CharacterMap application says that "é" is "U+00E9" while the document contains C3A9 instead.
- 所有变音符号都被编码为 C3 ("?") + 第二个字节
- 字符“to”被错误地编码为 C320(“?”)
- Windows 的 CharacterMap 应用程序说“é”是“U + 00E9”,而文档包含 C3A9。
I have a couple of questions:
我有一些问题:
Why the difference between the document and
CharacterMap: Is the document encoded in something else thanUnicode? For instance, why iséencoded asC3A9instead of00E9?I use the following VB.Net code to convert the document from
UnicodetoAnsi: How can I replace all occurrences ofC320withà?Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252") Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8") Dim Str As String Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText))) Clipboard.SetText(Str)
为什么文档与
CharacterMap: 文档是否以其他方式编码Unicode?例如,为什么é编码为C3A9而不是00E9?我用下面的代码VB.Net将文档从转换
Unicode到Ansi:我怎么能代替所有出现的C320用à?Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252") Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8") Dim Str As String Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText))) Clipboard.SetText(Str)
回答by u4370109
(Answered in a question edit. Converted to a community wiki answer. See What is the appropriate action when the answer to a question is added to the question itself?)
(在问题编辑中回答。转换为社区 wiki 答案。请参阅将问题的答案添加到问题本身时的适当操作是什么?)
The OP wrote:
OP写道:
For others' benefit, problem solved using the following code:
Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252") Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8") Dim Str As String Str = Clipboard.GetText Str = Str.Replace("? ", "? ") Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str))) Clipboard.SetText(Str) MessageBox.Show(Str)In the
Str.Replace() above, the second byte in the source is a space (20) while the second byte in the target is "No break space" (160).
为了他人的利益,使用以下代码解决了问题:
Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252") Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8") Dim Str As String Str = Clipboard.GetText Str = Str.Replace("? ", "? ") Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str))) Clipboard.SetText(Str) MessageBox.Show(Str)在
Str.Replace(上面的) 中,源中的第二个字节是一个空格 (20),而目标中的第二个字节是“无中断空间”(160)。

