UTF-8、CString 和 CFile?(C++, MFC)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2318481/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8, CString and CFile? (C++, MFC)
提问by SeargX
I'm currently working on a MFC program that specifically has to work with UTF-8. At some point, I have to write UTF-8 data into a file; to do that, I'm using CFiles and CStrings.
我目前正在开发一个专门使用 UTF-8 的 MFC 程序。在某些时候,我必须将 UTF-8 数据写入文件;为此,我使用 CFiles 和 CStrings。
When I get to write utf-8 (russian characters, to be more precise) data into a file, the output looks like
当我将 utf-8(更准确地说是俄语字符)数据写入文件时,输出看起来像
Dà???÷àòàí?:
?è?ò?ìà
?e?è?a???òa?
and etc. This is assurely not utf-8. To read this data properly, I have to change my system settings; changing non ASCII characters to a russian encoding table does work, but then all my latin based non-ascii characters get to fail. Anyway, that's how I do it.
等等。这肯定不是utf-8。要正确读取这些数据,我必须更改系统设置;将非 ASCII 字符更改为俄语编码表确实有效,但是我所有基于拉丁语的非 ascii 字符都会失败。无论如何,我就是这样做的。
CFile CSVFile( m_sCible, CFile::modeCreate|CFile::modeWrite);
CString sWorkingLine;
//Add stuff into sWorkingline
CSVFile.Write(sWorkingLine,sWorkingLine.GetLength());
//Clean sWorkingline and start over
Am I missing something? Shall I use something else instead? Is there some kind of catch I've missed? I'll be tuned in for your wisdom and experience, fellow programmers.
我错过了什么吗?我要不要用别的东西代替?有什么我错过的吗?各位程序员,我会关注你们的智慧和经验。
EDIT: Of course, as I just asked a question, I finally find something which might be interesting, that can be found here. Thought I might share it.
编辑:当然,因为我刚刚问了一个问题,我终于找到了一些可能很有趣的东西,可以在这里找到。以为我可以分享它。
EDIT 2:
编辑2:
Okay, so I added the BOM to my file, which now contains chineese character, probably because I didn't convert my line to UTF-8. To add the bom I did...
好的,所以我将 BOM 添加到我的文件中,该文件现在包含中文字符,可能是因为我没有将我的行转换为 UTF-8。添加我做的bom...
char BOM[3]={0xEF, 0xBB, 0xBF};
CSVFile.Write(BOM,3);
And after that, I added...
在那之后,我补充说......
TCHAR TestLine;
//Convert the line to UTF-8 multibyte.
WideCharToMultiByte (CP_UTF8,0,sWorkingLine,sWorkingLine.GetLength(),TestLine,strlen(TestLine)+1,NULL,NULL);
//Add the line to file.
CSVFile.Write(TestLine,strlen(TestLine)+1);
But then I cannot compile, as I don't really know how to get the length of TestLine. strlen doesn't seem to accept TCHAR.Fixed, used a static lenght of 1000 instead.
但是后来我无法编译,因为我真的不知道如何获得 TestLine 的长度。strlen 似乎不接受 TCHAR。已修复,改为使用静态长度 1000。
EDIT 3:
编辑 3:
So, I added this code...
所以,我添加了这段代码......
wchar_t NewLine[1000];
wcscpy( NewLine, CT2CW( (LPCTSTR) sWorkingLine ));
TCHAR* TCHARBuf = new TCHAR[1000];
//Convert the line to UTF-8 multibyte.
WideCharToMultiByte (CP_UTF8,0,NewLine,1000,TCHARBuf,1000,NULL,NULL);
//Find how many characters we have to add
size_t size = 0;
HRESULT hr = StringCchLength(TCHARBuf, MAX_PATH, &size);
//Add the line to the file
CSVFile.Write(TCHARBuf,size);
It compiles fine, but when I go look at my new file, it's exactly the same as when I didn't have all this new code (ex : Dà???÷àòàí?:). It feels like I didn't do a step forward, although I guess only a small thing is what separates me from victory.
它编译得很好,但是当我查看我的新文件时,它与我没有所有这些新代码时完全相同(例如:Dà???÷àòàí?:)。感觉就像我没有向前迈出一步,虽然我想只有一件小事是我与胜利的区别。
EDIT 4:
编辑 4:
I removed previously added code, as Nate asked, and I decided to use his code instead, meaning that now, when I get to add my line, I have...
我删除了之前添加的代码,正如 Nate 所要求的,我决定改用他的代码,这意味着现在,当我添加我的行时,我有......
CT2CA outputString(sWorkingLine, CP_UTF8);
//Add line to file.
CSVFile.Write(outputString,::strlen(outputString));
Everything compiles fine, but the russian characters are shown as ???????. Getting closer, but still not that. Btw, I'd like to thank everyone who tried/tries to help me, it is MUCH appreciated. I've been stuck on this for a while now, I can't wait for this problem to be gone.
一切都编译正常,但俄语字符显示为 ???????。越来越近了,但仍然不是那样。顺便说一句,我要感谢所有尝试/尝试帮助我的人,非常感谢。我已经被困在这个问题上一段时间了,我等不及这个问题消失了。
FINAL EDIT (I hope) By changing the way I first got my UTF-8 characters (I reencoded without really knowing), which was erroneous with my new way of outputting the text, I got acceptable results. By adding the UTF-8 BOM char at the beginning of my file, it could be read as Unicode in other programs, like Excel.
最终编辑(我希望)通过改变我第一次获得 UTF-8 字符的方式(我在不知道的情况下重新编码),这与我输出文本的新方式是错误的,我得到了可以接受的结果。通过在我的文件开头添加 UTF-8 BOM 字符,它可以在其他程序(如 Excel)中被读取为 Unicode。
Hurray! Thank you everyone!
欢呼!谢谢大家!
回答by Nate
When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):
当你输出数据时你需要做(这里假设你在 Unicode 模式下编译,这是强烈推荐的):
CString russianText = L"Привет мир";
CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);
CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));
If _UNICODE
is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:
如果_UNICODE
未定义(您在多字节模式下工作),您需要知道您的输入文本所在的代码页并将其转换为您可以使用的内容。此示例显示使用 UTF-16 格式的俄语文本,并将其保存为 UTF-8:
// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));
More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:
更有可能的是,您的俄语文本位于其他一些代码页中,例如 KOI-8R。在这种情况下,您需要将其他代码页转换为 UTF-16。然后将 UTF-16 转换为 UTF-8。您不能使用转换宏直接从 KOI-8R 转换为 UTF-8,因为它们总是尝试将窄文本转换为系统代码页。所以简单的方法是这样做:
// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));
You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).
您不需要 BOM(它是可选的;除非有特定原因,否则我不会使用它)。
Make sure you read this: http://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA
(for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.
请务必阅读此内容:http: //msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx。如果您错误地使用CT2CA
(例如,使用赋值运算符),您将遇到麻烦。链接的文档页面显示了如何使用和如何不使用它的示例。
Further information:
更多信息:
- The Cin
CT2CA
indicatesconst
. I use it when possible, but some conversions only support the non-const version (e.g.CW2A
). - The Tin
CT2CA
indicates that you are converting fromanLPCTSTR
. Thus it will work whether your code is compiled with the_UNICODE
flag or not. You could also useCW2A
(where Windicates wide characters). - The Ain
CT2CA
indicates that you are converting to an "ANSI" (8-bit char) string. - Finally, the second parameter to
CT2CA
indicates the code page you are converting to.
- 该Ç中
CT2CA
表示const
。我尽可能使用它,但某些转换仅支持非常量版本(例如CW2A
)。 - 该牛逼的
CT2CA
表示要转换的一个LPCTSTR
。因此,无论您的代码是否使用_UNICODE
标志编译,它都会起作用。您也可以使用CW2A
(其中W表示宽字符)。 - 的甲中
CT2CA
表明要转换到一个“ANSI”(8位字符)串。 - 最后,第二个参数 to
CT2CA
指示您要转换到的代码页。
To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:
要进行反向转换(从 UTF-8 到 LPCTSTR),您可以执行以下操作:
CString myString(CA2CT(russianText, CP_UTF8));
In this case, we are converting froman "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTR
is always assumed to be UTF-16 (if _UNICODE
is defined) or the current system code page (if _UNICODE
is not defined).
在本例中,我们将从UTF-8 格式的“ANSI”字符串转换为 LPCTSTR。所述LPCTSTR
总是假设为UTF-16(如果_UNICODE
被定义)或当前系统代码页面(如果_UNICODE
没有定义)。
回答by Nick Dandoulakis
You'll have to convert sWorkingLine
to UTF-8 and then write it in the file.
您必须转换sWorkingLine
为 UTF-8,然后将其写入文件。
WideCharToMultiBytecan convert unicode strings to UTF-8 if you select the CP_UTF8
codepage.
MultiByteToWideCharcan convert ASCII chars to unicode.
如果选择CP_UTF8
代码页,WideCharToMultiByte可以将 unicode 字符串转换为 UTF-8 。
MultiByteToWideChar可以将 ASCII 字符转换为 unicode。
回答by user261840
Make sure you're using Unicode (TCHAR is wchar_t). Then before you write the data, convert it using the WideCharToMultiByte Win32 API function.
确保您使用的是 Unicode(TCHAR 是 wchar_t)。然后在写入数据之前,使用 WideCharToMultiByte Win32 API 函数对其进行转换。