C++ Visual Studio 字符编码问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1857668/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C++ Visual Studio character encoding issues
提问by MPelletier
Not being able to wrap my head around this one is a real source of shame...
无法将我的头环绕在这个周围是一种真正的耻辱......
I'm working with a French version of Visual Studio (2008), in a French Windows (XP). French accents put in strings sent to the output window get corrupted. Ditto input fromthe output window. Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect. What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?
我在法语 Windows (XP) 中使用法语版本的 Visual Studio (2008)。发送到输出窗口的字符串中的法语口音会损坏。同上从输出窗口输入。典型的字符编码问题,我输入ANSI,得到UTF-8作为回报,或者类似的东西。当向输出窗口显示“硬编码”字符串时,什么设置可以确保字符保留在 ANSI 中?
EDIT:
编辑:
Example:
例子:
#include <iostream>
int main()
{
std:: cout << "àéêù" << std:: endl;
return 0;
}
Will show in the output:
将在输出中显示:
óúÛ¨
呸
(here encoded as HTML for your viewing pleasure)
(这里编码为 HTML 以供您观看)
I would really like it to show:
我真的很想它显示:
àéêù
àéêù
回答by Bahbar
Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specificationstates in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).
在我继续之前,我应该提到你正在做的事情不符合 c/c++ 标准。该规范中2.2的状态是什么字符集是在源代码中有效。它在那里并不多,并且所有使用的字符都是ascii。所以......下面的一切都是关于一个特定的实现(碰巧的是,美国语言环境机器上的 VC2008)。
To start with, you have 4 chars on your cout
line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.
首先,您的行中有 4 个字符cout
,输出中有 4 个字形。所以问题不在于 UTF8 编码,因为它会将多个源字符组合成更少的字形。
From you source string to the display on the console, all those things play a part:
从源字符串到控制台上的显示,所有这些都起作用:
- What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
- What your compiler does with a string literal, and what source encoding it understands
- how your
<<
interprets the encoded string you're passing in - what encoding the console expects
- how the console translates that output to a font glyph.
- 您的源文件采用什么编码(即编译器将如何查看您的 C++ 文件)
- 您的编译器对字符串文字做了什么,以及它理解的源编码
- 你如何
<<
解释你传入的编码字符串 - 控制台期望什么编码
- 控制台如何将该输出转换为字体字形。
Now...
现在...
1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.
1和2是相当容易的。看起来编译器会猜测源文件的格式,并将其解码为其内部表示。无论源编码是什么,它都会在当前代码页中生成字符串文字对应的数据块。我没有找到明确的细节/控制。
3 is even easier. Except for control codes, <<
just passes the data down for char *.
3更容易。除了控制代码,<<
只需将数据向下传递给char *。
4 is controlled by SetConsoleOutputCP
. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP
(the input is controlled differently, through SetConsoleCP
)
4 由 控制SetConsoleOutputCP
。它应该默认为您的默认系统代码页。您还可以弄清楚您使用GetConsoleOutputCP
的是哪一个(输入的控制方式不同,通过SetConsoleCP
)
5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).
5是一个有趣的。我用CP1252(西欧,windows)敲了敲脑袋想弄清楚为什么我不能让é正确显示。事实证明,我的系统字体没有该字符的字形,并且有用地使用了我的标准代码页的字形(大写 Theta,如果我不调用 SetConsoleOutputCP,我会得到相同的字形)。为了修复它,我不得不将我在控制台上使用的字体更改为 Lucida Console(一种真正的字体)。
Some interesting things I learned looking at this:
我从中学到了一些有趣的事情:
- the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as
233 0
) - VC is picking a codepage for the string literals that I do not seem to control.
- controlling what the console shows is more painful than what I was expecting
- 源代码的编码无关紧要,只要编译器可以解决(特别是,将其更改为 UTF8 并没有更改生成的代码。我的“é”字符串仍然使用 CP1252 编码为
233 0
) - VC 正在为我似乎无法控制的字符串文字选择代码页。
- 控制控制台显示的内容比我预期的更痛苦
So... what does this mean to you ? Here are bits of advice:
所以……这对你来说意味着什么?以下是一些建议:
- don't use non-ascii in string literals. Use resources, where youcontrol the encoding.
- make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
- if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer.
char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0]
does show 233 for me, which happens to be the encoding in CP1252.
- 不要在字符串文字中使用非 ascii。使用资源,您可以在其中控制编码。
- 确保您知道您的控制台需要什么编码,并且您的字体具有代表您发送的字符的字形。
- 如果您想弄清楚在您的情况下使用的是什么编码,我建议将字符的实际值打印为整数。
char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0]
确实为我显示了 233,这恰好是 CP1252 中的编码。
BTW, if what you got was "óú?¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.
顺便说一句,如果你得到的是“óú?¨”而不是你粘贴的,那么看起来你的4个字节在某处被解释为CP850。
回答by ruf
Try this:
尝试这个:
#include <iostream>
#include <locale>
int main()
{
std::locale::global(std::locale(""));
std::cout << "àéêù" << std::endl;
return 0;
}
回答by Davislor
Because I was requested to, I'll do some necromancy. The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different. Also, the accepted answer was incomplete even back in 2009.
因为我被要求,我会做一些死灵法术。其他答案来自 2009 年,但这篇文章仍然是我在 2018 年进行的搜索。今天的情况非常不同。此外,即使在 2009 年,接受的答案也不完整。
The Source Character Set
源字符集
Every compiler (including Microsoft's Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.
每个编译器(包括 Microsoft 的 Visual Studio 2008 及更高版本,gcc、clang 和 icc)都会毫无问题地读取以 BOM 开头的 UTF-8 源文件,并且 clang 不会读取除 UTF-8 之外的任何内容,因此带有 BOM 的 UTF-8是 C 和 C++ 源文件的最小公分母。
The language standard doesn't say what source character sets the compiler needs to support. Some real-world source files are even saved in a character set incompatible with ASCII. Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.
语言标准没有说明编译器需要支持哪些源字符集。一些现实世界的源文件甚至以与 ASCII 不兼容的字符集保存。2008 年的 Microsoft Visual C++ 支持带有字节顺序标记的 UTF-8 源文件,以及两种形式的 UTF-16。如果没有字节顺序标记,它会假设文件是用当前的 8 位代码页编码的,它始终是 ASCII 的超集。
The Execution Character Sets
执行字符集
In 2012, the compiler added a /utf-8
switch to CL.EXE
. Today, it also supports the /source-charset
and /execution-charset
switches, as well as /validate-charset
to detect if your file is not actually UTF-8. This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++.
2012 年,编译器添加了一个/utf-8
切换到CL.EXE
. 今天,它还支持/source-charset
和/execution-charset
开关,以及/validate-charset
检测您的文件是否实际上不是 UTF-8。 MSDN 上的这个页面有一个链接,指向关于每个版本的 Visual C++ 的 Unicode 支持的文档。
Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a'
, and a execution wide-character set that determines the value of wide-character constants like L'é'
.
C++ 标准的当前版本规定编译器必须同时具有确定字符常量(如 )的数值'a'
的执行字符集和确定宽字符常量(如 )值的执行宽字符集L'é'
。
To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them. It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0'
through '9'
must be consecutive. Neither capital nor lowercase letters have to be, because they weren't on some old mainframes. (That is, '0'+9
must be the same as '9'
, but there is still a compiler in real-world use today whose default behavior is that 'a'+9
is not 'j'
but '?'
, and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \Uxxxxxxxx
, but a compiler that didn't could claim to be complying with the standard.
对于语言律师来说,标准中对如何编码这些内容的要求很少,但 Visual C 和 C++ 设法打破了它们。它必须包含那些不能有负值100个字符,而数字的编码'0'
通过'9'
必须是连续的。大写和小写字母都不必是,因为它们不在一些旧的大型机上。(即,'0'+9
必须与 相同'9'
,但今天在实际使用中仍然有一个编译器,其默认行为'a'+9
是 not 'j'
but'?'
,这是合法的。)宽字符执行集必须包括基本执行集,并且有足够的位来保存任何支持的语言环境的所有字符。每个主流编译器都至少支持一种 Unicode 语言环境,并且可以理解用 指定的有效 Unicode 字符\Uxxxxxxxx
,但没有这种编译器就不能声称符合标准。
The way Visual C and C++ violate the language standard is by making their wchar_t
UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t
must be a fixed-width encoding. This is because Microsoft defined wchar_t
as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API. It does support the standard char32_t
type as well.
Visual C 和 C++ 违反语言标准的方式是使它们的wchar_t
UTF-16 只能将某些字符表示为代理对,而标准规定wchar_t
必须是固定宽度编码。这是因为微软wchar_t
在 1990 年代定义为 16 位宽,在 Unicode 委员会发现 16 位对于整个世界来说是不够的之前,微软不会破坏 Windows API。它也支持标准char32_t
类型。
UTF-8 String Literals
UTF-8 字符串文字
The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory. You've been able to write something like this since C++11:
这个问题引发的第三个问题是如何让编译器在内存中将字符串文字编码为 UTF-8。从 C++11 开始,你已经能够写出这样的东西:
constexpr unsigned char hola_utf8[] = u8"?Hola, mundo!";
This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM's Z-series mainframe compiler). That is, it's equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }
.
无论源字符集是 UTF-8、UTF-16、Latin-1、CP1252 还是 IBM EBCDIC 1047(这是一个愚蠢的理论示例,但仍然,为了向后兼容,IBM 的 Z 系列大型机编译器的默认设置)。也就是说,它相当于用 初始化数组{ 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }
。
If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:
如果输入字符太不方便,或者如果您想区分表面相同的字符(例如空格和不间断空格或预组合和组合字符),您还可以使用通用字符转义:
constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";
You can use these regardless of the source character set and regardless of whether you're storing the literal as UTF-8, UTF-16 or UCS-4. They were originally added in C99, but Microsoft supported them in Visual Studio 2015.
无论源字符集如何,也无论您将文字存储为 UTF-8、UTF-16 还是 UCS-4,您都可以使用它们。它们最初是在 C99 中添加的,但 Microsoft 在 Visual Studio 2015 中支持它们。
Edit:As reported by Matthew, u8"
strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8
or /source-charset:utf-8 /execution-charset:utf-8
. The sample code above works properly in 19.22.27905.
编辑:正如 Matthew 所报告的,u8"
字符串在某些版本的 MSVC 中存在问题,包括 19.14。事实证明,文字非 ASCII 字符也是如此,即使您指定了/utf-8
或/source-charset:utf-8 /execution-charset:utf-8
。上面的示例代码在 19.22.27905 中正常工作。
There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes. You would have encoded UTF-8 literals in that version of the compiler with:
还有另一种方法可以在 Visual C 或 C++ 2008 中执行此操作,但是:八进制和十六进制转义码。您可以在该版本的编译器中对 UTF-8 文字进行编码:
const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";
回答by Charles Anderson
I tried this code:
我试过这个代码:
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::wstringstream wss;
wss << L"àéêù";
std::wstring s = wss.str();
const wchar_t* p = s.c_str();
std::wcout << ws.str() << std::endl;
std::wofstream file("C:\a.txt");
file << p << endl;
return 0;
}
The debugger showed that wss, s and p all had the expected values (i.e. "àéêù"), as did the output file. However, what appeared in the console was óú?¨.
调试器显示 wss、s 和 p 都具有预期值(即“àéêù”),输出文件也是如此。然而,控制台中出现的是óú?¨。
The problem is therefore in the Visual Studio console, not the C++. Using Bahbar's excellent answer, I added:
因此,问题出在 Visual Studio 控制台,而不是 C++。使用 Bahbar 的出色回答,我补充说:
SetConsoleOutputCP(1252);
as the first line, and the console output then appeared as it should.
作为第一行,然后控制台输出显示为它应有的样子。
回答by Marc.2377
Using _setmode()
works1. and is arguably better than changing the codepage or setting a locale, since it'll actually make your program output in Unicode and thus will be consistent - no matter which codepage or locale are currently set.
使用_setmode()
作品1.并且可以说比更改代码页或设置区域设置更好,因为它实际上会使您的程序输出为 Unicode,因此将保持一致 - 无论当前设置的是哪个代码页或区域设置。
Example:
例子:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain()
{
_setmode( _fileno(stdout), _O_U16TEXT );
std::wcout << L"àéêù" << std::endl;
return 0;
}
Inside Visual Studio, make sure you set up your project for Unicode (Right-click Project-> Click General-> Character Set= Use Unicode Character Set).
在 Visual Studio 中,确保为 Unicode 设置项目(右键单击Project-> 单击General-> Character Set= Use Unicode Character Set)。
MinGW users:
MinGW用户:
- Define both
UNICODE
and_UNICODE
- Add
-finput-charset=iso-8859-1
to the compiler optionsto get around this error: "converting to execution character set: Invalid argument" - Add
-municode
to the linker optionsto get around "undefined reference to `WinMain@16" (read more).
- 定义两者
UNICODE
和_UNICODE
- 添加
-finput-charset=iso-8859-1
到编译器选项以解决此错误:“转换为执行字符集:无效参数” - 添加
-municode
到链接器选项以绕过“对`WinMain@16 的未定义引用”(阅读更多)。
Edit:The equivalent call to set unicode inputis: _setmode( _fileno(stdin), _O_U16TEXT );
编辑:设置 unicode输入的等效调用是:_setmode( _fileno(stdin), _O_U16TEXT );
Edit 2: An important piece of information, specially considering the question uses std::cout
. This is not supported. The MSDN Docsstates (emphasis mine):
编辑 2:一条重要的信息,特别是考虑到问题使用std::cout
. 这不受支持。在MSDN文档状态(重点煤矿):
Unicode mode is for wide print functions (for example, wprintf) and is not supported for narrow print functions. Use of a narrow print function on a Unicode mode stream triggers an assert.
Unicode 模式用于宽打印功能(例如 wprintf),不支持窄打印功能。在 Unicode 模式流上使用窄打印功能会触发断言。
So, don't use std::cout
when the console output mode is _O_U16TEXT
; similarly, don't use std::cin
when the console input is _O_U16TEXT
. You must use the wide version of these facilities (std::wcout
, std::wcin
).
And do note that mixing cout
and wcout
in the same output is not allowed (but I find it works if you call flush()
and then _setmode()
before switching between the narrow and wide operations).
所以,std::cout
当控制台输出模式为时不要使用_O_U16TEXT
;同样,std::cin
当控制台输入为_O_U16TEXT
. 您必须使用这些工具的广泛版本 ( std::wcout
, std::wcin
)。
并且请注意,不允许在同一输出中混合cout
和wcout
(但我发现如果在窄操作和宽操作之间切换之前调用flush()
和 然后它可以_setmode()
工作)。
回答by vladasimovic
//Save As Windows 1252
#include<iostream>
#include<windows.h>
int main()
{
SetConsoleOutputCP(1252);
std:: cout << "àéêù" << std:: endl;
}
Visual Studio does not supports UTF 8 for C++, but partially supports for C:
Visual Studio 不支持 C++ 的 UTF 8,但部分支持 C:
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main()
{
SetConsoleOutputCP(65001);
printf("àéêù\n");
}
回答by Mikal
Make sure you do not forget to change the console's font to Lucida Consolasas mentionned by Bahbar : it was crucial in my case (French win 7 64 bit with VC 2012).
确保您不要忘记将控制台的字体更改为 Bahbar 提到的Lucida Consolas:这对我来说至关重要(French win 7 64 bit with VC 2012)。
Then as mentionned by others use SetConsoleOutputCP(1252) for C++ but it may fail depending on the available pages so you might want to use GetConsoleOutputCP() to check that it worked or at least to check that SetConsoleOutputCP(1252) returns zero. Changing the global locale also works (for some reason there is no need to do cout.imbue(locale()); but it may break some librairies!
然后正如其他人所提到的,将 SetConsoleOutputCP(1252) 用于 C++,但它可能会失败,具体取决于可用页面,因此您可能想要使用 GetConsoleOutputCP() 来检查它是否有效或至少检查 SetConsoleOutputCP(1252) 是否返回零。更改全局语言环境也有效(出于某种原因,无需执行 cout.imbue(locale());但它可能会破坏某些库!
In C, SetConsoleOutputCP(65001); or the locale-based approach worked for me once I had saved the source code as UTF8 without signature(scroll down, the sans-signature choice is way below in the list of pages).
在 C 中, SetConsoleOutputCP(65001); 或者当我将源代码保存为没有签名的 UTF8后,基于语言环境的方法对我有用(向下滚动,无签名选项在页面列表的下方)。
Inputusing SetConsoleCP(65001); failed for me apparently due to a bad implementation of page 65001 in windows. The locale approach failed too both in C and C++. A more involved solution, not relying on native chars but on wchar_t seems required.
使用 SetConsoleCP(65001)输入;显然,由于在 Windows 中页面 65001 的实现不当,我失败了。语言环境方法在 C 和 C++ 中也失败了。一个更复杂的解决方案,不依赖于原生字符,而是依赖 wchar_t 似乎是必需的。
回答by Gary
I had the same problem with Chinese input. My source code is utf8 and I added /utf-8 in the compiler option. It works fine under c++ wide-string and wide-char but not work under narrow-string/char which it shows Garbled character/code in Visual Studio 2019 debugger and my SQL database. I have to use the narrow characters because of converting to SQLAPI++'s SAString. Eventually, I find checking the following option (contorl panel->Region->Administrative->Change system locale) can resolve the issue. I know it is not an ideal solution but it does help me.
我在输入中文时遇到了同样的问题。我的源代码是 utf8,我在编译器选项中添加了 /utf-8。它在 c++ 宽字符串和宽字符下工作正常,但在窄字符串/字符下不工作,它在 Visual Studio 2019 调试器和我的 SQL 数据库中显示乱码字符/代码。由于转换为 SQLAPI++ 的 SAString,我必须使用窄字符。最终,我发现检查以下选项(控制面板->区域->管理->更改系统区域设置)可以解决问题。我知道这不是一个理想的解决方案,但它确实对我有帮助。