vb.net 将损坏的 UTF-8 文本转换并更正为 ANSI?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22068189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 16:55:08  来源:igfitidea点击:

Convert and correct corrupt UTF-8 text into ANSI?

vb.netutf-8character-encodingansi

提问by Gulbahar

I am not a professional developer, and am having a problem converting Unicode textto ANSIfound in a legacy application that doesn't support Unicode.

我不是专业开发人员,并且在转换Unicode textANSI在不支持Unicode.

Here's a sample of what a Unicode-encoded text looks like when displayed in that legacy application:

以下是Unicode在该遗留应用程序中显示的-encoded 文本的示例:

? chaque journ??e des quatre jours de colloque, entre 250 et 500 personnes sont venues assister en continu aux discussions de cette rencontre. Cette affluence, ainsi que la richesse et la vari??t?? des discussions engag??es lors de ces conf??rences, confirment la n??cessit?? d'un espace ouvert pour les pens??es critiques dans le monde francophone, ? l'universit?? et bien au-del? .

? 会议四天的每一天,都有250至500人不断前来参加本次会议的讨论。这种富裕,以及丰富多样 在这些会议期间发起的讨论确认需要 法语世界批判性思维的开放空间,?大学??远远超出?.

I notice the following things:

我注意到以下几点:

  • All diacritic characters are encoded as C3 ("?") + a second byte
  • The character "à" is wrongly encoded as C320 ("? ")
  • Windows' CharacterMap application says that "é" is "U+00E9" while the document contains C3A9 instead.
  • 所有变音符号都被编码为 C3 ("?") + 第二个字节
  • 字符“to”被错误地编码为 C320(“?”)
  • Windows 的 CharacterMap 应用程序说“é”是“U + 00E9”,而文档包含 C3A9。

I have a couple of questions:

我有一些问题:

  1. Why the difference between the document and CharacterMap: Is the document encoded in something else than Unicode? For instance, why is éencoded as C3A9instead of 00E9?

  2. I use the following VB.Net code to convert the document from Unicodeto Ansi: How can I replace all occurrences of C320with à?

    Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
    Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")
    Dim Str As String
    Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText)))
    Clipboard.SetText(Str)
    
  1. 为什么文档与CharacterMap: 文档是否以其他方式编码Unicode?例如,为什么é编码为C3A9而不是00E9

  2. 我用下面的代码VB.Net将文档从转换UnicodeAnsi:我怎么能代替所有出现的C320à

    Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
    Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")
    Dim Str As String
    Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText)))
    Clipboard.SetText(Str)
    

回答by u4370109

(Answered in a question edit. Converted to a community wiki answer. See What is the appropriate action when the answer to a question is added to the question itself?)

(在问题编辑中回答。转换为社区 wiki 答案。请参阅将问题的答案添加到问题本身时的适当操作是什么?

The OP wrote:

OP写道:

For others' benefit, problem solved using the following code:

Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")

Dim Str As String
Str = Clipboard.GetText
Str = Str.Replace("? ", "? ")
Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str)))
Clipboard.SetText(Str)
MessageBox.Show(Str)

In the Str.Replace() above, the second byte in the source is a space (20) while the second byte in the target is "No break space" (160).

为了他人的利益,使用以下代码解决了问题:

Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")

Dim Str As String
Str = Clipboard.GetText
Str = Str.Replace("? ", "? ")
Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str)))
Clipboard.SetText(Str)
MessageBox.Show(Str)

Str.Replace(上面的) 中,源中的第二个字节是一个空格 (20),而目标中的第二个字节是“无中断空间”(160)。