C# 如何编码和解码破碎的中文/Unicode字符？

Question

提问by melaos

I've tried googling around but wasn't able to find what charset that this text below belongs to:

我试过谷歌搜索，但无法找到以下文本所属的字符集：

?…·??‰é?é???”￠?”?è￡????1???±??è???…￥è￡???

?…·??‰é?é???”￠?”?è￡????1???±??è???...￥è￡???

But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8">and keeping that string into an HTML file, I was able to view the Chinese characters properly:

但是<meta http-equiv="Content-Type" Content="text/html; charset=utf-8">将该字符串保存到 HTML 文件中，我能够正确查看中文字符：

具有靜電產生裝置之影像輸入裝置

具有静电产生装置之影像输入装置

So my question is:

所以我的问题是：

What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?

我可以使用哪些工具来检测此文本的字符集？
以及如何在 C# 中正确转换/编码/解码它们？

Updates: For completion sake, i've updated this test.

更新：为了完成起见，我更新了这个测试。

   [TestMethod]
    public void TestMethod1()
    {
        string encodedText = "?…·??‰é?é???”￠?”?è￡????1???±??è???…￥è￡???";
        Encoding utf8 = new UTF8Encoding();
        Encoding window1252 = Encoding.GetEncoding("Windows-1252");

        byte[] postBytes = window1252.GetBytes(encodedText);

        string decodedText = utf8.GetString(postBytes);
        string actualText = "具有靜電產生裝置之影像輸入裝置";
        Assert.AreEqual(actualText, decodedText);
    }
}

Thanks.

谢谢。

Answer 1

采纳答案by Mark Tolonen

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.

当您将“坏”字符串保存在带有声明正确编码的元标记的文本文件中时，发生的情况是您的文本编辑器正在使用 Windows-1252 编码保存文件，但浏览器正在读取文件并将其解释为 UTF -8. 由于使用 Windows-1252 编码错误地将“坏”字符串解码为 UTF-8 字节，因此您通过将文件编码为 Windows-1252 并解码为 UTF-8 来反转该过程。

Here's an example:

下面是一个例子：

using System.Text;
using System.Windows.Forms;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
            Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
            Encoding Utf8 = Encoding.UTF8;
            byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
            string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
            MessageBox.Show(badDecode,"Mis-decoded");  // Shows your garbage string.
            string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
            MessageBox.Show(goodDecode, "Correctly decoded");

            // Recovering from bad decode...
            byte[] originalBytes = Windows1252.GetBytes(badDecode);
            goodDecode = Utf8.GetString(originalBytes);
            MessageBox.Show(goodDecode, "Re-decoded");
        }
    }
}

Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.

即使正确解码，您仍然需要支持显示字符的字体。如果您的默认字体不支持中文，您仍然可能看不到正确的字符。

The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.

正确的做法是弄清楚为什么您拥有的字符串首先被解码为 Windows-1252。但有时，数据库中的数据一开始就存储不正确，您必须求助于这些游戏来解决问题。

Answer 2

回答by lesderid

I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. Let's assume the character encoding is called "FooBar":

我不太确定你的意思，但我猜你想在字节数组形式的某种编码的字符串和字符串之间进行转换。让我们假设字符编码被称为“FooBar”：

This is how you encode and decode:

这是您编码和解码的方式：

Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);

You can learn more about the Encoding class over at MSDN.

您可以在MSDN 上了解有关 Encoding 类的更多信息。

Answer 3

回答by eyossi

Answering your question at the end of your post:

在帖子末尾回答您的问题：

If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/
for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx

如果您想确定运行时的文本编码，您应该查看：http: //code.google.com/p/ude/
要转换字符集，您可以使用http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx

Answer 4

回答by dda

It's Windows Latin 1. I pasted the Chinese text as UTF-8 into BBEDIT (a text editor for Mac) and re-opened the file as Windows Latin 1 and bang, the exact diacritics appeared.

它是 Windows Latin 1。我将中文文本作为 UTF-8 粘贴到 BBEDIT（Mac 的文本编辑器）中，然后以 Windows Latin 1 和 bang 的形式重新打开文件，出现了确切的变音符号。

Answer 5

回答by mesutpiskin

string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin 

byte[] bytes = Encoding.Unicode.GetBytes(test);

string s = string.Empty;

for (int i = 0; i < bytes.Length; i++)
{
    s += (char)bytes[i];
}

s = s.Trim((char)0);

MessageBox.Show(s);
//s=mesutpiskin

C# 如何编码和解码破碎的中文/Unicode字符？

提问by melaos

采纳答案by Mark Tolonen

回答by lesderid

回答by eyossi

回答by dda

回答by mesutpiskin

相关推荐

最近更新

标签

C# 如何编码和解码破碎的中文/Unicode字符？

提问by melaos

采纳答案by Mark Tolonen

回答by lesderid

回答by eyossi

回答by dda

回答by mesutpiskin

相关推荐

C# 什么是 ICollection？

C# 出现错误“System.IndexOutOfRangeException”。为什么？

C# 从 Windows 窗体应用程序发送电子邮件

C#：将字节数组转换为字符串并打印到控制台

相关推荐

最近更新

标签