windows 将 UTF-16 转换为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3082620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 14:41:29  来源:igfitidea点击:

Convert UTF-16 to UTF-8

c++windowswinapiunicodeutf

提问by Cheok Yan Cheng

I am current using VC++ 2008 MFC. Due to PostgreSQL doesn't support UTF-16 (Encoding used by Windows for Unicode), I need to convert string from UTF-16 to UTF-8, before store it.

我目前使用 VC++ 2008 MFC。由于 PostgreSQL 不支持 UTF-16(Windows 用于 Unicode 的编码),我需要在存储之前将字符串从 UTF-16 转换为 UTF-8。

Here is my code snippet.

这是我的代码片段。

// demo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "demo.h"
#include "Utils.h"
#include <iostream>

#ifdef _DEBUG
#define new DEBUG_NEW
#endif


// The one and only application object

CWinApp theApp;

using namespace std;

int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
{
    int nRetCode = 0;

    // initialize MFC and print and error on failure
    if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0))
    {
        // TODO: change error code to suit your needs
        _tprintf(_T("Fatal Error: MFC initialization failed\n"));
        nRetCode = 1;
    }
    else
    {
        // TODO: code your application's behavior here.
    }

    CString utf16 = _T("Hello");
    std::cout << utf16.GetLength() << std::endl;
    CStringA utf8 = UTF8Util::ConvertUTF16ToUTF8(utf16);
    std::cout << utf8.GetLength() << std::endl;
    getchar();
    return nRetCode;
}

and the conversion functions.

和转换函数。

namespace UTF8Util
{
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF8ToUTF16
// DESC: Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).
//----------------------------------------------------------------------------
CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )
{
    //
    // Special case of NULL or empty input string
    //
    if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '
CW2A utf8(buffer, CP_UTF8);
const char* data = utf8.m_psz;
') ) { // Return empty string return L""; } // // Consider CHAR's count corresponding to total input string length, // including end-of-string (
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in LPCWSTR pszTextUTF16 ) {
    if (pszTextUTF16 == NULL) return "";

    int utf16len = wcslen(pszTextUTF16);
    int utf8len = WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len, 
        NULL, 0, NULL, NULL );

    CArray<CHAR> buffer;
    buffer.SetSize(utf8len+1);
    buffer.SetAt(utf8len, '
CStringA str = CW2A(wStr, CP_UTF8);
'); WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len, buffer.GetData(), utf8len, 0, 0 ); return buffer.GetData(); }
) character // const size_t cchUTF8Max = INT_MAX - 1; size_t cchUTF8; HRESULT hr = ::StringCchLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 ); if ( FAILED( hr ) ) { AtlThrow( hr ); } // Consider also terminating ##代码## ++cchUTF8; // Convert to 'int' for use with MultiByteToWideChar API int cbUTF8 = static_cast<int>( cchUTF8 ); // // Get size of destination UTF-16 buffer, in WCHAR's // int cchUTF16 = ::MultiByteToWideChar( CP_UTF8, // convert from UTF-8 MB_ERR_INVALID_CHARS, // error on invalid chars pszTextUTF8, // source UTF-8 string cbUTF8, // total length of source UTF-8 string, // in CHAR's (= bytes), including end-of-string ##代码## NULL, // unused - no conversion done in this step 0 // request size of destination buffer, in WCHAR's ); ATLASSERT( cchUTF16 != 0 ); if ( cchUTF16 == 0 ) { AtlThrowLastWin32(); } // // Allocate destination buffer to store UTF-16 string // CStringW strUTF16; WCHAR * pszUTF16 = strUTF16.GetBuffer( cchUTF16 ); // // Do the conversion from UTF-8 to UTF-16 // int result = ::MultiByteToWideChar( CP_UTF8, // convert from UTF-8 MB_ERR_INVALID_CHARS, // error on invalid chars pszTextUTF8, // source UTF-8 string cbUTF8, // total length of source UTF-8 string, // in CHAR's (= bytes), including end-of-string ##代码## pszUTF16, // destination buffer cchUTF16 // size of destination buffer, in WCHAR's ); ATLASSERT( result != 0 ); if ( result == 0 ) { AtlThrowLastWin32(); } // Release internal CString buffer strUTF16.ReleaseBuffer(); // Return resulting UTF16 string return strUTF16; } //---------------------------------------------------------------------------- // FUNCTION: ConvertUTF16ToUTF8 // DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8. //---------------------------------------------------------------------------- CStringA ConvertUTF16ToUTF8( __in const WCHAR * pszTextUTF16 ) { // // Special case of NULL or empty input string // if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L'##代码##') ) { // Return empty string return ""; } // // Consider WCHAR's count corresponding to total input string length, // including end-of-string (L'##代码##') character. // const size_t cchUTF16Max = INT_MAX - 1; size_t cchUTF16; HRESULT hr = ::StringCchLengthW( pszTextUTF16, cchUTF16Max, &cchUTF16 ); if ( FAILED( hr ) ) { AtlThrow( hr ); } // Consider also terminating ##代码## ++cchUTF16; // // WC_ERR_INVALID_CHARS flag is set to fail if invalid input character // is encountered. // This flag is supported on Windows Vista and later. // Don't use it on Windows XP and previous. // #if (WINVER >= 0x0600) DWORD dwConversionFlags = WC_ERR_INVALID_CHARS; #else DWORD dwConversionFlags = 0; #endif // // Get size of destination UTF-8 buffer, in CHAR's (= bytes) // int cbUTF8 = ::WideCharToMultiByte( CP_UTF8, // convert to UTF-8 dwConversionFlags, // specify conversion behavior pszTextUTF16, // source UTF-16 string static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's, // including end-of-string ##代码## NULL, // unused - no conversion required in this step 0, // request buffer size NULL, NULL // unused ); ATLASSERT( cbUTF8 != 0 ); if ( cbUTF8 == 0 ) { AtlThrowLastWin32(); } // // Allocate destination buffer for UTF-8 string // CStringA strUTF8; int cchUTF8 = cbUTF8; // sizeof(CHAR) = 1 byte CHAR * pszUTF8 = strUTF8.GetBuffer( cchUTF8 ); // // Do the conversion from UTF-16 to UTF-8 // int result = ::WideCharToMultiByte( CP_UTF8, // convert to UTF-8 dwConversionFlags, // specify conversion behavior pszTextUTF16, // source UTF-16 string static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's, // including end-of-string ##代码## pszUTF8, // destination buffer cbUTF8, // destination buffer size, in bytes NULL, NULL // unused ); ATLASSERT( result != 0 ); if ( result == 0 ) { AtlThrowLastWin32(); } // Release internal CString buffer strUTF8.ReleaseBuffer(); // Return resulting UTF-8 string return strUTF8; } } // namespace UTF8Util

However, during runtime, I get the exception at

但是,在运行时,我在

ATLASSERT( cbUTF8 != 0 );

ATLASERT( cbUTF8 != 0 );

while trying to get size of destination UTF-8 buffer

在尝试获取目标 UTF-8 缓冲区的大小时

  1. What thing I had missed out?
  2. If I am testing using a Chinese characters, How can I verify the resultant UTF-8 string is correct?
  1. 我错过了什么?
  2. 如果我使用中文字符进行测试,如何验证生成的 UTF-8 字符串是否正确?

回答by Rob

You can also use the ATL String Conversion Macros- to convert from UTF-16 to UTF-8 use CW2Aand pass CP_UTF8as the code page, e.g.:

您还可以使用ATL 字符串转换宏- 从 UTF-16 转换为 UTF-8 使用CW2ACP_UTF8作为代码页传递,例如:

##代码##

回答by Gunslinger47

The problem is you specified the WC_ERR_INVALID_CHARSflag:

问题是您指定了WC_ERR_INVALID_CHARS标志:

Windows Vista and later:Fail if an invalid input character is encountered. If this flag is not set, the function silently drops illegal code points. A call to GetLastError returns ERROR_NO_UNICODE_TRANSLATION. Note that this flag only applies when CodePage is specified as CP_UTF8 or 54936 (for Windows Vista and later). It cannot be used with other code page values.

Windows Vista 及更高版本:如果遇到无效输入字符,则失败。如果未设置此标志,则该函数会静默删除非法代码点。调用 GetLastError 返回 ERROR_NO_UNICODE_TRANSLATION。请注意,此标志仅在 CodePage 指定为 CP_UTF8 或 54936(对于 Windows Vista 及更高版本)时适用。它不能与其他代码页值一起使用。

Your conversion function seems quite long. How does this one work for you?

您的转换函数似乎很长。这对你有什么作用?

##代码##

I see you use a function called StringCchLengthWto get the required length of the output buffer. Most of the places I look recommend using the WideCharToMultiBytefunction itself to tell you how many CHARs it wants.

我看到您使用了一个函数StringCchLengthW来获取所需的输出缓冲区长度。我查看的大多数地方都建议使用WideCharToMultiByte函数本身来告诉您它需要多少个 CHAR。

Edit:
As Rob pointed out, you can use CW2A with the CP_UTF8 code page:

编辑:
正如 Rob 所指出的,您可以将 CW2A 与 CP_UTF8 代码页一起使用:

##代码##

While I'm editing, I can answer your second question:

在我编辑的时候,我可以回答你的第二个问题:

How can I verify the resultant UTF-8 string is correct?

如何验证生成的 UTF-8 字符串是否正确?

Write it to a text file, then open it in Mozilla Firefox or an equivillant program. In the View menu, you can go to Character Encoding and switch manually to UTF-8 (assuming Firefox didn't guess it correctly to begin with). Compare it with a UTF-16 document with the same text and see if there are any differences.

将其写入文本文件,然后在 Mozilla Firefox 或同等程序中打开它。在“查看”菜单中,您可以转到“字符编码”并手动切换到 UTF-8(假设 Firefox 一开始没有正确猜测)。将其与具有相同文本的 UTF-16 文档进行比较,看看是否有任何差异。