C++ 如何将 Unicode 字符串转换为 utf-8 或 utf-16 字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/280347/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:18:42  来源:igfitidea点击:

How to convert Unicode string into a utf-8 or utf-16 string?

c++unicodeutf-8character-encodingutf-16

提问by user25749

How to convert Unicode string into a utf-8 or utf-16 string? My VS2005 project is using Unicode char set, while sqlite in cpp provide

如何将 Unicode 字符串转换为 utf-8 或 utf-16 字符串?我的 VS2005 项目使用的是 Unicode 字符集,而 cpp 中的 sqlite 提供

int sqlite3_open(
  const char *filename,   /* Database filename (UTF-8) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);
int sqlite3_open16(
  const void *filename,   /* Database filename (UTF-16) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);

for opening a folder. How can I convert string, CString, or wstring into UTF-8 or UTF-16 charset?

用于打开文件夹。如何将字符串、CString 或 wstring 转换为 UTF-8 或 UTF-16 字符集?

Thanks very much!

非常感谢!

回答by 1800 INFORMATION

Use the WideCharToMultiBytefunction. Specify CP_UTF8for the CodePageparameter.

使用WideCharToMultiByte函数。CP_UTF8CodePage参数指定。

CHAR buf[256]; // or whatever
WideCharToMultiByte(
  CP_UTF8, 
  0, 
  StringToConvert, // the string you have
  -1, // length of the string - set -1 to indicate it is null terminated
  buf, // output
  __countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
  NULL,    
  NULL
);

Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16.

此外,Windows 中 unicode 应用程序的默认编码是 UTF-16LE,因此您可能不需要执行任何翻译,只需使用第二个版本sqlite3_open16

回答by Serge Wautier

Short answer:

简短的回答:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16(). You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

如果使用 CString 或 wstring 等 Unicode 字符串,则不需要转换。使用 sqlite3_open16()。您必须确保将 WCHAR 指针(转换为void *. 似乎很蹩脚!即使这个库是跨平台的,我猜他们也可以定义一个依赖于平台的宽字符类型,并且比 a 更不友好void *)到应用程序接口。例如对于 CString:(void*)(LPCWSTR)strFilename

The longer answer:

更长的答案:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

您没有要转换为 UTF8 或 UTF16 的 Unicode 字符串。您在程序中使用给定的编码表示了一个 Unicode 字符串:Unicode 本身不是二进制表示。编码说明了 Unicode 代码点(数值)在内存中的表示方式(数字的二进制布局)。UTF8 和 UTF16 是使用最广泛的编码。但是它们非常不同。

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

当 VS 项目说“Unicode 字符集”时,它实际上意味着“字符被编码为 UTF16”。因此,您可以直接使用 sqlite3_open16()。无需转换。字符存储在 WCHAR 类型(而不是char)中,它需要 16 位(标准 C 类型的回退wchar_t,在 Win32 上需要 16 位。在其他平台上可能会有所不同。感谢纠正,跳棋)。

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

您可能还需要注意一个细节:UTF16 存在两种风格:Big Endian 和 Little Endian。这就是这 16 位的字节顺序。您为 UTF16 提供的函数原型没有说明使用哪种排序。但是假设 sqlite 使用与 Windows 相同的字节序(Little Endian IIRC。我知道顺序,但名称总是有问题:-)),你就很安全了。

EDIT: Answer to comment by Checkers:

编辑:跳棋评论的回答:

UTF16 uses 16 bits code units. Under Win32 (and onlyon Win32), wchar_tis used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits code units. They are called Surrogate Pairs.

UTF16 使用 16 位代码单元。在 Win32 下(且在 Win32 上),wchar_t用于此类存储单元。诀窍是某些 Unicode 字符需要 2 个这样的 16 位代码单元的序列。它们被称为代理对。

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the chartype.

与 UTF8 使用 1 到 4 个字节序列表示 1 个字符的方式相同。然而 UTF8 与char类型一起使用。

回答by jalf

All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.

所有 C++ 字符串类型都是字符集中性的。他们只是确定字符宽度,而不做进一步的假设。wstring 在 Windows 中使用 16 位字符,大致对应于 utf-16,但它仍然取决于您在线程中存储的内容。wstring 不会以任何方式强制您放入其中的数据必须是有效的 utf16。Windows 在定义 UNICODE 时使用 utf16,所以很可能你的字符串已经是 utf16,你不需要做任何事情。

A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.

其他一些人建议使用 WideCharToMultiByte 函数,这是将 utf16 转换为 utf8 的(一种)方法。但是由于 sqlite 可以处理 utf16,所以没有必要。

回答by Johannes Schaub - litb

utf-8 and utf-16 are both "unicode" character encodings. What you probably talk about is utf-32 which is a fixed-size character encoding. Maybe searching for

utf-8 和 utf-16 都是“unicode”字符编码。您可能谈论的是 utf-32,它是一种固定大小的字符编码。也许正在寻找

"Convert utf-32 into utf-8 or utf-16"

"Convert utf-32 into utf-8 or utf-16"

provides you some results or other papers on this.

为您提供有关此的一些结果或其他论文。

回答by Helstrom

The simplest way to do this is to use CStringA. The CString class is a typedef for either CStringA (ASCII version) or CStringW (wide char version). Both of these classes have constructors to convert string types. I typically use:

最简单的方法是使用 CStringA。CString 类是 CStringA(ASCII 版本)或 CStringW(宽字符版本)的 typedef。这两个类都有构造函数来转换字符串类型。我通常使用:

sqlite3_open(CStringA(L"MyWideCharFileName"), ...);