如何在 Visual C++ 2008 中创建 UTF-8 字符串文字
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/688760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a UTF-8 string literal in Visual C++ 2008
提问by brofield
In VC++ 2003, I could just save the source file as UTF-8 and all strings were used as is. In other words, the following code would print the strings as is to the console. If the source file was saved as UTF-8 then the output would be UTF-8.
在 VC++ 2003 中,我可以将源文件保存为 UTF-8,并且所有字符串都按原样使用。换句话说,以下代码将按原样将字符串打印到控制台。如果源文件保存为 UTF-8,则输出将为 UTF-8。
printf("Chinese (Traditional)");
printf("china語 (繁体)");
printf("??? (??)");
printf("Chinês (Tradicional)");
I have saved the file in UTF-8 format with the UTF-8 BOM. However compiling with VC2008 results in:
我已使用 UTF-8 BOM 以 UTF-8 格式保存文件。但是使用 VC2008 编译会导致:
warning C4566: character represented by universal-character-name '\uC911'
cannot be represented in the current code page (932)
warning C4566: character represented by universal-character-name '\uAD6D'
cannot be represented in the current code page (932)
etc.
The characters causing these warnings are corrupted. The ones that do fit the locale (in this case 932 = Japanese) are converted to the locale encoding, i.e. Shift-JIS.
导致这些警告的字符已损坏。符合区域设置的那些(在本例中为 932 = 日语)被转换为区域设置编码,即 Shift-JIS。
I cannot find a way to get VC++ 2008 to compile this for me. Note that it doesn't matter what locale I use in the source file. There doesn't appear to be a locale that says "I know what I'm doing, so don't f$%##ng change my string literals". In particular, the useless UTF-8 pseudo-locale doesn't work.
我找不到让 VC++ 2008 为我编译它的方法。请注意,我在源文件中使用的语言环境无关紧要。似乎没有一个语言环境说“我知道我在做什么,所以不要 f$%##ng 更改我的字符串文字”。特别是,无用的 UTF-8 伪语言环境不起作用。
#pragma setlocale(".65001")
=> error C2175: '.65001' : invalid locale
Neither does "C":
“C”也不行:
#pragma setlocale("C")
=> see warnings above (in particular locale is still 932)
It appears that VC2008 forces all characters into the specified (or default) locale, and that locale cannot be UTF-8. I do not want to change the file to use escape strings like "\xbf\x11..." because the same source is compiled using gcc which can quite happily deal with UTF-8 files.
似乎 VC2008 强制所有字符进入指定(或默认)区域设置,并且该区域设置不能是 UTF-8。我不想更改文件以使用转义字符串,如“\xbf\x11...”,因为相同的源代码是使用 gcc 编译的,它可以非常愉快地处理 UTF-8 文件。
Is there any way to specify that compilation of the source file should leave string literals untouched?
有什么方法可以指定源文件的编译应该保持字符串文字不变?
To ask it differently, what compile flags can I use to specify backward compatibility with VC2003 when compiling the source file. i.e. do not change the string literals, use them byte for byte as they are.
换个说法,在编译源文件时,我可以使用哪些编译标志来指定与 VC2003 的向后兼容性。即不要更改字符串文字,按原样逐字节使用它们。
Update
更新
Thanks for the suggestions, but I want to avoid wchar. Since this app deals with strings in UTF-8 exclusively, using wchar would then require me to convert all strings back into UTF-8 which should be unnecessary. All input, output and internal processing is in UTF-8. It is a simple app that works fine as is on Linux and when compiled with VC2003. I want to be able to compile the same app with VC2008 and have it work.
感谢您的建议,但我想避免使用 wchar。由于此应用程序专门处理 UTF-8 中的字符串,因此使用 wchar 将要求我将所有字符串转换回 UTF-8,这应该是不必要的。所有输入、输出和内部处理均采用 UTF-8。这是一个简单的应用程序,在 Linux 上和使用 VC2003 编译时都可以正常工作。我希望能够用 VC2008 编译同一个应用程序并让它工作。
For this to happen, I need VC2008 to not try to convert it to my local machine's locale (Japanese, 932). I want VC2008 to be backward compatible with VC2003. I want a locale or compiler setting that says strings are used as is, essentially as opaque arrays of char, or as UTF-8. It looks like I might be stuck with VC2003 and gcc though, VC2008 is trying to be too smart in this instance.
为此,我需要 VC2008 不要尝试将其转换为我本地机器的语言环境(日语,932)。我希望 VC2008 向后兼容 VC2003。我想要一个语言环境或编译器设置,说明字符串按原样使用,本质上是作为不透明的字符数组或 UTF-8。看起来我可能会被 VC2003 和 gcc 困住,但在这种情况下,VC2008 试图变得过于聪明。
采纳答案by brofield
Update:
更新:
I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).
我已经决定没有保证的方法来做到这一点。我在下面提供的解决方案适用于英文版 VC2003,但在使用日文版 VC2003(或者可能是日文操作系统)编译时失败。在任何情况下,都不能依赖它工作。请注意,即使将所有内容都声明为 L"" 字符串也不起作用(并且在 gcc 中很痛苦,如下所述)。
Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni(cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.
相反,我认为您只需要硬着头皮将所有文本移动到一个数据文件中并从那里加载它。我现在通过SimpleIni(跨平台 INI 文件库)存储和访问 INI 文件中的文本。至少可以保证它可以工作,因为所有文本都在程序之外。
Original:
原来的:
I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.
我自己回答这个问题,因为只有埃文似乎理解这个问题。关于什么是 Unicode 以及如何使用 wchar_t 的答案与此问题无关,因为这与国际化无关,也与对 Unicode 字符编码的误解无关。不过,我感谢您尝试提供帮助,如果我不够清楚,我深表歉意。
The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).
问题是我有源文件需要在各种平台和编译器下交叉编译。该程序进行 UTF-8 处理。它不关心任何其他编码。我想在 UTF-8 中使用字符串文字,就像目前使用 gcc 和 vc2003 一样。我如何用 VC2008 做到这一点?(即向后兼容的解决方案)。
This is what I have found:
这是我发现的:
gcc (v4.3.2 20081105):
gcc (v4.3.2 20081105):
- string literals are used as is (raw strings)
- supports UTF-8 encoded source files
- source files must not have a UTF-8 BOM
- 字符串文字按原样使用(原始字符串)
- 支持 UTF-8 编码的源文件
- 源文件不能有 UTF-8 BOM
vc2003:
vc2003:
- string literals are used as is (raw strings)
- supports UTF-8 encoded source files
- source files may or may not have a UTF-8 BOM (it doesn't matter)
- 字符串文字按原样使用(原始字符串)
- 支持 UTF-8 编码的源文件
- 源文件可能有也可能没有 UTF-8 BOM(没关系)
vc2005+:
vc2005+:
- string literals are massaged by the compiler (no raw strings)
- char string literals are re-encoded to a specified locale
- UTF-8 is not supported as a target locale
- source files must have a UTF-8 BOM
- 字符串文字由编译器处理(没有原始字符串)
- 字符字符串文字被重新编码为指定的语言环境
- 不支持 UTF-8 作为目标语言环境
- 源文件必须有一个 UTF-8 BOM
So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.
因此,简单的答案是,出于此特定目的,VC2005+ 已损坏并且不提供向后兼容的编译路径。将 Unicode 字符串放入编译程序的唯一方法是通过 UTF-8 + BOM + wchar,这意味着我需要在使用时将所有字符串转换回 UTF-8。
There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU projectfor some details.
没有任何简单的将 wchar 转换为 UTF-8 的跨平台方法,例如,wchar 的大小和编码是什么?在 Windows 上,UTF-16。在其他平台上?它因人而异。有关详细信息,请参阅ICU 项目。
In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.
最后,我决定避免使用 vc2005+ 以外的所有编译器的转换成本,其源代码如下。
#if defined(_MSC_VER) && _MSC_VER > 1310
// Visual C++ 2005 and later require the source files in UTF-8, and all strings
// to be encoded as wchar_t otherwise the strings will be converted into the
// local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these
// strings then need to be convert back to UTF-8. This function is just a rough
// example of how to do this.
# define utf8(str) ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
static char szBuf[1024];
WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
return szBuf;
}
#else
// Visual C++ 2003 and gcc will use the string literals as is, so the files
// should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM.
# define utf8(str) str
#endif
Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).
请注意,此代码只是一个简化示例。生产使用需要以多种方式清理它(线程安全、错误检查、缓冲区大小检查等)。
This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:
这就像下面的代码一样使用。它在我对 gcc、vc2003 和 vc2008 的测试中编译干净并正常工作:
std::string mText;
mText = utf8("Chinese (Traditional)");
mText = utf8("china語 (繁体)");
mText = utf8("??? (??)");
mText = utf8("Chinês (Tradicional)");
回答by brofield
Brofield,
布罗菲尔德,
I had the exact same problem and just stumbled on a solution that doesn't require converting your source strings to wide chars and back: save your source file as UTF-8 withoutsignature and VC2008 will leave it alone. Worked great when I figured out to drop the signature. To sum up:
我遇到了完全相同的问题,只是偶然发现了一个不需要将源字符串转换为宽字符并返回的解决方案:将源文件保存为没有签名的UTF-8 ,VC2008 将不理会它。当我想放弃签名时效果很好。总结:
Unicode (UTF-8 without signature) - Codepage 65001, doesn't throw the c4566 warning in VC2008 and doesn't cause VC to mess with the encoding, while Codepage 65001 (UTF-8 With Signature) does throw c4566 (as you have found).
Unicode(无签名的 UTF-8) - 代码页 65001,不会在 VC2008 中抛出 c4566 警告,也不会导致 VC 混淆编码,而代码页 65001(带签名的 UTF-8)确实会抛出 c4566(如您所见)成立)。
Hope that's not too late to help you, but it might speed up your VC2008 app to remove your workaround.
希望这对您有所帮助还为时不晚,但它可能会加速您的 VC2008 应用程序以删除您的解决方法。
回答by Evan Teran
While it is probably better to use wide strings and then convert as needed to UTF-8. I think your best bet is to as you have mentioned use hex escapes in the strings. Like suppose you wanted code point \uC911
, you could just do this.
虽然最好使用宽字符串,然后根据需要转换为 UTF-8。我认为您最好的选择是正如您所提到的那样在字符串中使用十六进制转义符。就像假设你想要代码点一样\uC911
,你可以这样做。
const char *str = "\xEC\xA4\x91";
I believe this will work just fine, just isn't very readable, so if you do this, please comment it to explain.
我相信这会工作得很好,只是不太可读,所以如果你这样做,请评论它来解释。
回答by Vladius
File/Advanced Save Options/Encoding: "Unicode (UTF-8 without signature) - Codepage 65001"
文件/高级保存选项/编码:“Unicode(无签名的UTF-8 ) - 代码页 65001”
回答by Henrik Haftmann
Visual C++ (2005+) COMPILER standard behaviour for source files is:
源文件的 Visual C++ (2005+) 编译器标准行为是:
- CP1252 (for this example, Western-European code page):
"?"
→C4 00
'?'
→C4
L"?"
→00C4 0000
L'?'
→00C4
- UTF-8 without BOM:
"?"
→C3 84 00
(= UTF-8)'?'
→ warning: multi-character constant"?"
→E2 84 A6 00
(= UTF-8, as expected)L"A"
→00C3 0084 0000
(wrong!)L'?'
→ warning: multi-character constantL"?"
→00E2 0084 00A6 0000
(wrong!)
- UTF-8 with BOM:
"?"
→C4 00
(= CP1252, no more UTF-8),'?'
→C4
"?"
→ error: cannot convert to CP1252!L"?"
→00C4 0000
(correct)L'?'
→00C4
L"?"
→2126 0000
(correct)
- CP1252(对于此示例,西欧代码页):
"?"
→C4 00
'?'
→C4
L"?"
→00C4 0000
L'?'
→00C4
- 没有 BOM 的 UTF-8:
"?"
→C3 84 00
(= UTF-8)'?'
→ 警告:多字符常量"?"
→E2 84 A6 00
(= UTF-8,如预期)L"A"
→00C3 0084 0000
(错!)L'?'
→ 警告:多字符常量L"?"
→00E2 0084 00A6 0000
(错!)
- 带有 BOM 的 UTF-8:
"?"
→C4 00
(= CP1252,不再是 UTF-8),'?'
→C4
"?"
→ 错误:无法转换为 CP1252!L"?"
→00C4 0000
(正确)L'?'
→00C4
L"?"
→2126 0000
(正确)
You see, the C compiler handles UTF-8 files without BOM the same way as CP1252. As a result, it is impossible for the compiler to intermix UTF-8 and UTF-16 strings into the compiled output! So you have to decide for one source code file:
您会看到,C 编译器以与 CP1252 相同的方式处理没有 BOM 的 UTF-8 文件。因此,编译器不可能将 UTF-8 和 UTF-16 字符串混合到编译输出中!所以你必须决定一个源代码文件:
- eitheruse UTF-8 with BOM and generate UTF-16 strings only (i.e. always use
L
prefix), - orUTF-8 without BOM and generate UTF-8 strings only (i.e. never use
L
prefix). - 7-bit ASCII characters are not involved and can be used with or without
L
prefix
- 要么使用带有 BOM 的 UTF-8 并仅生成 UTF-16 字符串(即始终使用
L
前缀), - 或者没有 BOM 的 UTF-8 并且只生成 UTF-8 字符串(即从不使用
L
前缀)。 - 不涉及 7 位 ASCII 字符,可带
L
前缀或不带前缀使用
Independently, the EDITOR can auto-detect UTF-8 files without BOM as UTF-8 files.
独立地,EDITOR 可以将没有 BOM 的 UTF-8 文件自动检测为 UTF-8 文件。
回答by Alexander Jung
From a comment to this very nice blog
"Using UTF-8 as the internal representation for strings in C and C++ with Visual Studio"
=> http://www.nubaria.com/en/blog/?p=289
从对这个非常好的博客的评论
“使用 UTF-8 作为 C 和 C++ 中字符串的内部表示与 Visual Studio”
=> http://www.nubaria.com/en/blog/?p=289
#pragma execution_character_set("utf-8")
It requires Visual Studio 2008 SP1, and the following hotfix:
它需要 Visual Studio 2008 SP1 和以下修补程序:
回答by Martin Liversage
How about this? You store the strings in a UTF-8 encoded file and then preprocess them into an ASCII encoded C++ source file. You keep the UTF-8 encoding inside the string by using hexadecimal escapes. The string
这个怎么样?您将字符串存储在 UTF-8 编码的文件中,然后将它们预处理为 ASCII 编码的 C++ 源文件。您可以使用十六进制转义符将 UTF-8 编码保留在字符串中。字符串
"china語 (繁体)"
is converted to
转换为
"\xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E (\xE7\xB9\x81\xE4\xBD\x93)"
Of course this is unreadable by any human, and the purpose is just to avoid problems with the compiler.
当然,这是任何人都无法读取的,目的只是为了避免编译器出现问题。
You could either use the C++ preprocessor to reference the strings in converted header file or you could convert you entire UTF-8 source into ASCII before compilation using this trick.
您可以使用 C++ 预处理器来引用转换后的头文件中的字符串,也可以在使用此技巧进行编译之前将整个 UTF-8 源代码转换为 ASCII。
回答by Michael J
A portable conversion from whatever native encoding you have is straightforward using char_traits::widen().
使用 char_traits::widen() 从您拥有的任何本机编码进行可移植转换很简单。
#include <locale>
#include <string>
#include <vector>
/////////////////////////////////////////////////////////
// NativeToUtf16 - Convert a string from the native
// encoding to Unicode UTF-16
// Parameters:
// sNative (in): Input String
// Returns: Converted string
/////////////////////////////////////////////////////////
std::wstring NativeToUtf16(const std::string &sNative)
{
std::locale locNative;
// The UTF-16 will never be longer than the input string
std::vector<wchar_t> vUtf16(1+sNative.length());
// convert
std::use_facet< std::ctype<wchar_t> >(locNative).widen(
sNative.c_str(),
sNative.c_str()+sNative.length(),
&vUtf16[0]);
return std::wstring(vUtf16.begin(), vUtf16.end());
}
In theory, the return journey, from UTF-16 to UTF-8 should be similarly easy, but I found that the UTF-8 locales do not work properly on my system (VC10 Express on Win7).
理论上,从 UTF-16 到 UTF-8 的返回过程应该同样容易,但我发现 UTF-8 语言环境在我的系统上无法正常工作(Win7 上的 VC10 Express)。
Thus I wrote a simple converter based on RFC 3629.
因此我编写了一个基于 RFC 3629 的简单转换器。
/////////////////////////////////////////////////////////
// Utf16ToUtf8 - Convert a character from UTF-16
// encoding to UTF-8.
// NB: Does not handle Surrogate pairs.
// Does not test for badly formed
// UTF-16
// Parameters:
// chUtf16 (in): Input char
// Returns: UTF-8 version as a string
/////////////////////////////////////////////////////////
std::string Utf16ToUtf8(wchar_t chUtf16)
{
// From RFC 3629
// 0000 0000-0000 007F 0xxxxxxx
// 0000 0080-0000 07FF 110xxxxx 10xxxxxx
// 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
// max output length is 3 bytes (plus one for Nul)
unsigned char szUtf8[4] = "";
if (chUtf16 < 0x80)
{
szUtf8[0] = static_cast<unsigned char>(chUtf16);
}
else if (chUtf16 < 0x7FF)
{
szUtf8[0] = static_cast<unsigned char>(0xC0 | ((chUtf16>>6)&0x1F));
szUtf8[1] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F));
}
else
{
szUtf8[0] = static_cast<unsigned char>(0xE0 | ((chUtf16>>12)&0xF));
szUtf8[1] = static_cast<unsigned char>(0x80 | ((chUtf16>>6)&0x3F));
szUtf8[2] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F));
}
return reinterpret_cast<char *>(szUtf8);
}
/////////////////////////////////////////////////////////
// Utf16ToUtf8 - Convert a string from UTF-16 encoding
// to UTF-8
// Parameters:
// sNative (in): Input String
// Returns: Converted string
/////////////////////////////////////////////////////////
std::string Utf16ToUtf8(const std::wstring &sUtf16)
{
std::string sUtf8;
std::wstring::const_iterator itr;
for (itr=sUtf16.begin(); itr!=sUtf16.end(); ++itr)
sUtf8 += Utf16ToUtf8(*itr);
return sUtf8;
}
I believe this should work on any platform, but I have not been able to test it except on my own system, so it may have bugs.
我相信这应该适用于任何平台,但我无法在我自己的系统上测试它,所以它可能有错误。
#include <iostream>
#include <fstream>
int main()
{
const char szTest[] = "Das tausendsch?ne Jungfr?ulein,\n"
"Das tausendsch?ne Herzelein,\n"
"Wollte Gott, wollte Gott,\n"
"ich w?r' heute bei ihr!\n";
std::wstring sUtf16 = NativeToUtf16(szTest);
std::string sUtf8 = Utf16ToUtf8(sUtf16);
std::ofstream ofs("test.txt");
if (ofs)
ofs << sUtf8;
return 0;
}
回答by raymai97
I know I'm late for party but I think I need to spread this out. For Visual C++ 2005 and above, if the source file doesn't contain BOM (byte-order mark), and your system locale is not English, VC will assume that your source file is not in Unicode.
我知道我参加派对迟到了,但我想我需要把它传播出去。对于 Visual C++ 2005 及更高版本,如果源文件不包含 BOM(字节顺序标记),并且您的系统区域设置不是英语,VC 将假定您的源文件不是 Unicode。
To get your UTF-8 source files compiled correctly, you must save in UTF-8 without BOMencoding, and the system locale (non-Unicode language) must be English.
要正确编译您的 UTF-8 源文件,您必须以 UTF-8 保存而不使用 BOM编码,并且系统区域设置(非 Unicode 语言)必须是 English。
回答by Windows programmer
Maybe try an experiment:
也许尝试一个实验:
#pragma setlocale(".UTF-8")
or:
或者:
#pragma setlocale("english_england.UTF-8")