C++ 中的 Unicode 处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/55641/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unicode Processing in C++
提问by Fortepianissimo
What is the best practice of Unicode processing in C++?
C++ 中 Unicode 处理的最佳实践是什么?
采纳答案by hazzen
- Use ICUfor dealing with your data (or a similar library)
- In your own data store, make sure everything is stored in the same encoding
- Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like
is_alpha
unless that is the definition you want. - I can't say it enough: never iterate over the indices of a
string
if you care about correctness, always use your unicode library for this.
- 使用 ICU处理您的数据(或类似的库)
- 在您自己的数据存储中,确保所有内容都以相同的编码存储
- 确保你总是使用你的 unicode 库来处理像字符串长度、大小写状态等的普通任务。
is_alpha
除非这是你想要的定义,否则永远不要使用标准库内置函数。 - 我不能说太多:如果你关心正确性,永远不要迭代 a 的索引,
string
为此总是使用你的 unicode 库。
回答by eestrada
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
如果您不关心与以前的 C++ 标准的向后兼容性,当前的 C++11 标准已经内置了 Unicode 支持:http: //www.open-std.org/JTC1/SC22/WG21/docs/papers/2011 /n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
因此,在 C++ 中处理 Unicode 的真正最佳实践是使用内置工具。但是,对于较旧的代码库,这并不总是可行的,因为目前的标准是如此新。
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited supportfor Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICUfor more in-depth processing. There are some proposals currently in the worksto include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
编辑:澄清一下,C++11 是 Unicode 感知的,因为它现在支持 Unicode 文字和 Unicode 字符串。但是,标准库对 Unicode 处理和转换的支持有限。对于您当前的需求,这可能就足够了。但是,如果您现在需要进行大量繁重的工作,那么您可能仍然需要使用ICU 之类的东西进行更深入的处理。有一些建议,目前的作品,包括针对不同编码之间进行文本转换更强大的支持。我的猜测(和希望)是这将成为下一份技术报告的一部分。
回答by jschroedl
Our company (and others) use the open source Internation Components for Unicode(ICU) library originally developed by Taligent.
我们公司(和其他公司)使用最初由 Taligent 开发的开源Unicode 国际组件(ICU)库。
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
它处理字符串、语言环境、转换、日期/时间、排序规则、转换等。阿尔。
Start with the ICU Userguide
从ICU 用户指南开始
回答by Adam Pierce
Here is a checklist for Windows programming:
这是 Windows 编程的清单:
- All strings enclosed in _T("my string")
- strlen() etc. functions replaced with _tcslen() etc.
- Use LPTSTR and LPCTSTR instead of char * and const char *
- When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
- For C++ strings, use std::wstring instead of std::string
- _T("my string") 中包含的所有字符串
- strlen() 等函数替换为 _tcslen() 等。
- 使用 LPTSTR 和 LPCTSTR 代替 char * 和 const char *
- 在 Dev Studio 中启动新项目时,请务必确保在项目属性中选择了 Unicode 选项。
- 对于 C++ 字符串,使用 std::wstring 而不是 std::string
回答by ine
Look at Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
这个问题有一个链接到微软关于 Unicode 的文档:http: //msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
如果您查看那篇文章旁边的 MSDN 左侧导航栏,您应该会找到很多与 Unicode 函数相关的信息。它是“编码字符”一章的一部分(http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
它有以下小节:
- The Code-Page Model
- Double-Byte Character Sets in Windows
- Unicode
- Compatibility Issues in Mixed Environments
- Unicode Data Conversion
- Migrating Windows-Based Programs to Unicode
- Summary
- 代码页模型
- Windows 中的双字节字符集
- 统一码
- 混合环境中的兼容性问题
- Unicode 数据转换
- 将基于 Windows 的程序迁移到 Unicode
- 概括
回答by Willow Schlanger
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
尽管这可能不是每个人的最佳实践,但您可以根据需要编写自己的 C++ UNICODE 例程!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
我刚刚完成了一个周末。我学到了很多东西,虽然我不能保证它 100% 没有错误,但我做了很多测试,它似乎可以正常工作。
My code is under the New BSD license and can be found here:
我的代码在新 BSD 许可下,可以在这里找到:
http://code.google.com/p/netwidecc/downloads/list
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
它被称为 WSUCONV 并带有一个示例 main() 程序,可以在 UTF-8、UTF-16 和标准 ASCII 之间进行转换。如果你扔掉主要代码,你就有了一个很好的用于读/写 UNICODE 的库。
回答by Paul Hutchinson
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
如上所述,在使用大型系统时,库是最好的选择。但是,有时您确实希望自己处理事情(可能是因为该库会使用许多资源,例如在微控制器上)。在这种情况下,您需要一个简单的库,您可以从中复制您实际需要的部分。
Willow Schlanger's example code seems like a good one (see his answer for more details).
Willow Schlanger 的示例代码看起来不错(有关更多详细信息,请参阅他的回答)。
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
我还发现了另一个代码较小的代码,但缺乏完整的错误检查,仅处理 UTF-8,但更容易去除部分。
Here's a list of the embedded libraries that seem decent.
这是一个看起来不错的嵌入式库列表。
Embedded libraries
嵌入式库
- http://code.google.com/p/netwidecc/downloads/list(UTF8, UTF16LE, UTF16BE, UTF32)
- http://www.cprogramming.com/tutorial/unicode.html(UTF8)
- http://utfcpp.sourceforge.net/(Simple UTF8 library)
回答by Jan Rüegg
Have a look at the recommendations of UTF-8 Everywhere
看看UTF-8 Everywhere的建议
回答by Joe Schneider
Use IBM's International Components for Unicode
使用 IBM 的Unicode 国际组件