C# 如何将 UTF-8 字符串转换为 Unicode？

Question

提问by remio

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

我有显示 UTF-8 编码字符的字符串，我想将其转换回 Unicode。

For now, my implementation is the following:

目前，我的实现如下：

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "d??j?".

我在玩这个词"déjà"。我已通过此在线工具将其转换为 UTF-8 ，因此我开始使用 string 测试我的方法"d??j?"。

Unfortunately, with this implementation the string just remains the same.

不幸的是，在这个实现中，字符串保持不变。

Where am I wrong?

我哪里错了？

Answer 1

采纳答案by bames53

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

所以问题是 UTF-8 代码单元值已作为 16 位代码单元序列存储在 C# 中string。您只需验证每个代码单元是否在一个字节的范围内，将这些值复制到字节中，然后将新的 UTF-8 字节序列转换为 UTF-16。

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# stringusing the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

这很容易，但最好找到根本原因；有人将 UTF-8 代码单元复制到 16 位代码单元的位置。可能的罪魁祸首是有人string使用错误的编码将字节转换为 C# 。例如Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)。

Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

或者，如果您确定知道用于生成字符串的不正确编码，并且不正确的编码转换是无损的（通常情况下，如果不正确的编码是单字节编码），那么您可以简单地进行逆编码获取原始UTF-8数据的步骤，然后您可以从UTF-8字节进行正确的转换：

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

Answer 2

回答by Hans Passant

I have string that displays UTF-8 encoded characters

我有显示 UTF-8 编码字符的字符串

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

.NET 中没有这样的东西。string 类只能存储 UTF-16 编码的字符串。UTF-8 编码的字符串只能作为字节 [] 存在。尝试将字节存储到字符串中不会有好结果；UTF-8 使用没有有效 Unicode 代码点的字节值。当字符串规范化时，内容将被销毁。因此，在您的 DecodeFromUtf8() 开始运行时恢复字符串已经为时已晚。

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

仅处理带有 byte[] 的 UTF-8 编码文本。并使用 UTF8Encoding.GetString() 对其进行转换。

Answer 3

回答by Mark Tolonen

What you have seems to be a stringincorrectly decoded from another encoding, likely code page 1252, which is US Windows default. Here's how to reverse, assuming no other loss. One loss not immediately apparent is the non-breaking space(U+00A0) at the end of your string that is not displayed. Of course it would be better to read the data source correctly in the first place, but perhaps the data source was stored incorrectly to begin with.

您所拥有的似乎是string从另一种编码中错误解码的，可能是代码页 1252，这是美国 Windows 的默认值。假设没有其他损失，这里是如何反转。一个不明显的损失non-breaking space是未显示的字符串末尾的(U+00A0)。当然，首先正确读取数据源会更好，但也许数据源一开始就存储不正确。

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "d??j?\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

Result:

结果：

déjà

Answer 4

回答by MEN

If you have a UTF-8 string, where every byte is correct ('?' -> [195, 0] , [150, 0]), you can use the following:

如果您有一个 UTF-8 字符串，其中每个字节都是正确的 ('?' -> [195, 0] , [150, 0])，您可以使用以下内容：

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

In my case the DLL result is a UTF-8 string too, but unfortunately the UTF-8 string is interpreted with UTF-16 encoding ('?' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 was converted to the UTF-16 '–' which is 8211. If you have this case too, you can use the following instead:

在我的情况下，DLL 结果也是 UTF-8 字符串，但不幸的是，UTF-8 字符串是用 UTF-16 编码（'?' -> [195, 0], [19, 32]）解释的。所以 ANSI '-' 是 150 被转换为 UTF-16 '-' 是 8211。如果你也有这种情况，你可以使用以下代码：

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

Or the Native-Method:

或本机方法：

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf16ToUtf8. Hope I could be of help.

如果您反过来需要它，请参阅Utf16ToUtf8。希望我能有所帮助。

C# 如何将 UTF-8 字符串转换为 Unicode？

提问by remio

采纳答案by bames53

回答by Hans Passant

回答by Mark Tolonen

回答by MEN

相关推荐

最近更新

标签

C# 如何将 UTF-8 字符串转换为 Unicode？

提问by remio

采纳答案by bames53

回答by Hans Passant

回答by Mark Tolonen

回答by MEN

相关推荐

C# 如果不为空，检查查询字符串参数值的最优雅方法是什么？

在 C# 中将 JSON 文本加载到类对象中

C# 从选定的 datagridview 行和哪个事件中获取数据？

C# 如何更改 Devexpress Grid 中单元格的背景颜色？

相关推荐

最近更新

标签