windows 将 UTF-8 字符转换为大写/小写 C++

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3672767/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 15:13:31  来源:igfitidea点击:

Converting UTF-8 Characters to Upper/Lower case C++

c++linuxwindowsunicodecross-platform

提问by NSA

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.

我有一个包含 UTF-8 字符的字符串,我有一个方法可以将每个字符转换为大写或小写,这很容易用与 ASCII 重叠的字符完成,显然有些字符无法转换,例如任何汉字。但是,有没有一种好方法可以检测和转换其他可以是大写/小写的字符,例如所有希腊字符?另请注意,我需要能够在 Windows 和 Linux 上执行此操作。

Thank you,

谢谢,

回答by Alexandre C.

Have a look at ICU.

看看ICU

Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".

请注意,小写到大写的函数取决于语言环境。想想土耳其语 (ascii) 字母 I 得到“无点小写 i”和 (ascii) i 得到“带点的大写 I”。

回答by tidwall

Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.

假设您有权访问 wctype.h,然后将您的文本转换为 2 字节的 unicode 字符串并使用 towupper()。然后将其转换回UTF-8。

回答by Davislor

On Linux, or with a standard library that supports it, you would obtain a std::localeobject for the appropriate locale, as uppercase conversion is locale-specific. Convert each UTF-8 character to a wchar_t, then call std::toupper()on it, then convert back to UTF-8. Note that the resulting string might be longer or shorter, and some ligatures might not work properly: ? to Ss in German is the example everyone keeps bringing up.

在 Linux 上,或使用支持它的标准库,您将获得一个std::locale适用于适当语言环境的对象,因为大写转换是特定于语言环境的。将每个 UTF-8 字符转换为wchar_t,然后调用std::toupper()它,然后转换回 UTF-8。请注意,生成的字符串可能更长或更短,并且某些连字可能无法正常工作: ? 德语中的 Ss 是每个人都不断提出的例子。

On Windows, this approach will work even less of the time, because wide characters are UTF-16 and not a fixed-width encoding (which violates the C++ language standard, but then maybe the standards committee shouldn't have tried to bluff Microsoft into breaking the Windows API). There is a ToUppermethod in the CLR.

在 Windows 上,这种方法的工作时间甚至更少,因为宽字符是 UTF-16 而不是固定宽度的编码(这违反了 C++ 语言标准,但也许标准委员会不应该试图欺骗微软破坏 Windows API)。ToUpperCLR中有一个方法。

It is probably easier to use a portable library such as ICU.

使用 ICU 等便携式库可能更容易。

Also make sure whether what you want is uppercase (capitalizing every letter) or titlecase (capitalizing the first letter of a string, or the first part of a ligature).

还要确保您想要的是大写(将每个字母大写)还是 titlecase(将字符串的第一个字母或连字的第一部分大写)。