如何在 C++ 中将 utf-8 转换为 ASCII?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2980253/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to convert utf-8 to ASCII in c++?
提问by Suri
i am getting response from server in utf-8 but not able to read that. how to convert utf-8 to ASCII in c++?
我收到来自服务器的 utf-8 响应,但无法读取。如何在 C++ 中将 utf-8 转换为 ASCII?
回答by Artelius
First note that ASCII is a 7-bit format. There are 8-bit encodings, if you are after one of these (such as ISO 8859-1) you'll need to be more specific.
首先请注意,ASCII 是一种 7 位格式。有 8 位编码,如果您使用其中之一(例如 ISO 8859-1),则需要更具体。
To convert an ASCII string to UTF-8, do nothing: they are the same. So if your UTF-8 string is composed onlyof ASCII characters, then it is already an ASCII string, and no conversion is necessary.
要将 ASCII 字符串转换为 UTF-8,什么都不做:它们是相同的。所以如果你的 UTF-8 字符串只由 ASCII 字符组成,那么它已经是一个 ASCII 字符串,不需要转换。
If the UTF-8 string contains non-ASCII characters (anything with accents or non-Latin characters), there is no way to convert it to ASCII. (You may be able to convert it to one of the ISO encodings perhaps.)
如果 UTF-8 字符串包含非 ASCII 字符(任何带有重音或非拉丁字符的字符),则无法将其转换为 ASCII。(您也许可以将其转换为 ISO 编码之一。)
There are ways to strip the accents from Latin characters to get at least some resemblance in ASCII. Alternatively if you just want to delete the non-ASCII characters, simply delete all bytes with values >= 128 from the utf-8 string.
有一些方法可以从拉丁字符中去除重音以在 ASCII 中至少获得一些相似之处。或者,如果您只想删除非 ASCII 字符,只需从 utf-8 字符串中删除值 >= 128 的所有字节。
回答by Aoi Karasu
This example works under Windows (you did not mention your target operating system):
这个例子在 Windows 下工作(你没有提到你的目标操作系统):
// The sample buffer contains "?ha?a?te?s" in UTF-8
unsigned char buffer[15] = { 0xc2, 0xa9, 0x68, 0x61, 0xc2, 0xae, 0x61, 0xc2, 0xa9, 0x74, 0x65, 0xc2, 0xae, 0x73, 0x00 };
// utf8 is the pointer to your UTF-8 string
char* utf8 = (char*)buffer;
// convert multibyte UTF-8 to wide string UTF-16
int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);
if (length > 0)
{
wchar_t* wide = new wchar_t[length];
MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);
// convert it to ANSI, use setlocale() to set your locale, if not set
size_t convertedChars = 0;
char* ansi = new char[length];
wcstombs_s(&convertedChars, ansi, length, wide, _TRUNCATE);
}
Remember to delete[] wide;
and/or ansi
when no longer needed. Since this is unicode, I'd recommend to stick to wchar_t*
instead of char*
unless you are certain that input buffer contains characters that belong to the same ANSI subset.
记住delete[] wide;
和/或ansi
不再需要时。由于这是 unicode,我建议坚持wchar_t*
而不是char*
除非您确定输入缓冲区包含属于同一 ANSI 子集的字符。
回答by J?rg W Mittag
If the string contains characters which do not exist in ASCII, then there is nothing you cando, because, well, those characters do not exist in ASCII.
如果字符串包含不存在ASCII字符,那么就没有什么,你可以这样做,因为,这些字符不ASCII存在。
If the string contains onlycharacters which doexist in ASCII, then there is nothing you needto do, because the string is alreadyin the ASCII encoding: UTF-8 was specifically designed to be backwards-compatible with ASCII in such a way that any character which is in ASCII has the exact same encoding in UTF-8 as it has in ASCII, and that any character which is notin ASCII can neverhave an encoding which is valid ASCII, i.e. will alwayshave an encoding which is illegal in ASCII (specifically, any non-ASCII character will be encoded as a sequence of 2–4 octets all of which have their most significant bit set, i.e. have an integer value > 127).
如果字符串仅包含ASCII中确实存在的字符,那么您无需执行任何操作,因为该字符串已经采用 ASCII 编码:UTF-8 专门设计为向后兼容 ASCII,因此任何ASCII 中的字符在 UTF-8 中的编码与它在 ASCII 中的编码完全相同,并且任何不在ASCII 中的字符永远不可能有一个有效的 ASCII 编码,即总是有一个在 ASCII 中是非法的编码(具体来说,任何非 ASCII 字符都将被编码为 2-4 个八位字节的序列,所有这些八位字节都设置了最高有效位,即整数值 > 127)。
Instead of simply trying to convertthe string, you could try to transliteratethe string. Most languages on this planet have some form of ASCII transliteration scheme that at least keeps the text somewhat comprehensible. For example, my first name is "J?rg" and its ASCII transliteration would be "Joerg". The name of the creator of the Ruby Programming Language is "まつもとゆきひろ" and its ASCII transliteration would be "Matsumoto Yukihiro". However, please note that you willlose information. For example, the German sz-ligature gets transliterated to "ss", so the word "Ma?e" (measurements) gets transliterated to "Masse". However, "Masse" (mass, in the physicist's sense, not the Christian's) is alsoa word. As another example, Turkish has 4 "i"s (small and capital, with and without dot) and ASCII only has 2 (small with dot and capital without dot), therefore you will either lose information about the dot or whether or not it was a capital letter.
您可以尝试音译字符串,而不是简单地尝试转换字符串。这个星球上的大多数语言都有某种形式的 ASCII 音译方案,至少可以使文本在某种程度上易于理解。例如,我的名字是“J?rg”,它的 ASCII 音译是“Joerg”。Ruby Programming Language 的创造者的名字是“まつもとゆきひろ”,它的 ASCII 音译是“Matsumoto Yukihiro”。但是,请注意,您将丢失信息。例如,德语 sz-ligature 被音译为“ss”,所以单词“Ma?e”(测量)被音译为“Masse”。然而,“质量”(质量,在物理学家的意义上,而不是基督徒的一个字。再举一个例子,土耳其语有 4 个“i”(小写和大写,有和没有点)而 ASCII 只有 2 个(小有点和大写没有点),因此你要么会丢失关于点的信息,要么会丢失它是一个大写字母。
So, the onlyway which will not lose information (in other words: corrupt data), is to somehow encodethe non-ASCII characters into sequences of ASCII characters. There are many popular encoding schemes: SGML entity references, MIME, Unicode escape sequences, ΤΕΧ or LaΤΕΧ. So, you would encode the data as it enters your system and decode it when it leaves the system.
因此,不会丢失信息(换句话说:损坏的数据)的唯一方法是以某种方式将非 ASCII 字符编码为 ASCII 字符序列。有许多流行的编码方案:SGML 实体引用、MIME、Unicode 转义序列、Τ ΕΧ 或 LaΤ ΕΧ。因此,您可以在数据进入系统时对其进行编码,并在其离开系统时对其进行解码。
Of course, the easiestway would be to simply fix your system.
当然,最简单的方法是简单地修复您的系统。
回答by CB Bailey
UTF-8 is an encoding that can map every unicode character. ASCII only supports a very small subset of unicode.
UTF-8 是一种可以映射每个 unicode 字符的编码。ASCII 只支持非常小的 unicode 子集。
For the subset of unicode that is ASCII, the mapping from UTF-8 to ASCII is a direct one-to-one byte mapping, so if the server sends you a document that only contains ASCII characters in UTF-8 encoding then you can directly read that as ASCII.
对于 ASCII 的 unicode 子集,从 UTF-8 到 ASCII 的映射是直接的一对一字节映射,所以如果服务器发送给你的文档只包含 UTF-8 编码的 ASCII 字符,那么你可以直接将其读为 ASCII。
If the response contains non-ASCII characters then, whatever you do, you won't be able to express them in ASCII. To filter these out of a UTF-8 stream you can just filter out any byte >= 128 (0x80 hex).
如果响应包含非 ASCII 字符,那么无论您做什么,都无法用 ASCII 表示它们。要从 UTF-8 流中过滤掉这些,您可以过滤掉任何 >= 128(0x80 十六进制)的字节。
回答by Kronen
Check this utf-8 String Library, forget about converting it to ASCII.
检查此utf-8 String Library,忘记将其转换为 ASCII。
回答by Learner
ASCII is a codepage representing 128 characters and control codes where as utf8 is able to represent any character in the Unicode standard which is much -much more to ASCII capabilities. So Answer to your Question is : Not Possible Unless you have any more specification for the data source.
ASCII 是一个代码页,表示 128 个字符和控制代码,而 utf8 能够表示 Unicode 标准中的任何字符,这对 ASCII 功能来说更重要。所以你的问题的答案是:不可能,除非你对数据源有更多的规范。
回答by Mike Weller
UTF-8 is backwards compatible with ASCII meaning all ASCII characters are encoded as single unchanged byte values in UTF-8. If the text should be ASCII but you are unable to read it then there must be another issue.
UTF-8 向后兼容 ASCII,这意味着所有 ASCII 字符都被编码为 UTF-8 中的单个未更改字节值。如果文本应该是 ASCII,但您无法阅读它,那么一定是另一个问题。
回答by herohuyongtao
Note that there are two UTF8
types: UTF8_with_BOMand UTF8_without_BOM. And you need to handle differently for them in convert to ANSI
. The following functions will work.
请注意,有两种UTF8
类型:UTF8_with_BOM和UTF8_without_BOM。并且您需要在转换为ANSI
. 以下功能将起作用。
UTF8_with_BOMto ANSI
void change_encoding_from_UTF8_with_BOM_to_ANSI(const char* filename) { ifstream infile; string strLine=""; string strResult=""; infile.open(filename); if (infile) { // the first 3 bytes (ef bb bf) is UTF-8 header flags // all the others are single byte ASCII code. // should delete these 3 when output getline(infile, strLine); strResult += strLine.substr(3)+"\n"; while(!infile.eof()) { getline(infile, strLine); strResult += strLine+"\n"; } } infile.close(); char* changeTemp=new char[strResult.length()]; strcpy(changeTemp, strResult.c_str()); char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp); strResult=changeResult; ofstream outfile; outfile.open(filename); outfile.write(strResult.c_str(),strResult.length()); outfile.flush(); outfile.close(); } // change a char's encoding from UTF8 to ANSI char* change_encoding_from_UTF8_to_ANSI(char* szU8) { int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0); wchar_t* wszString = new wchar_t[wcsLen + 1]; ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen); wszString[wcsLen] = '
'; int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL); char* szAnsi = new char[ansiLen + 1]; ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL); szAnsi[ansiLen] = 'void change_encoding_from_UTF8_without_BOM_to_ANSI(const char* filename) { ifstream infile; string strLine=""; string strResult=""; infile.open(filename); if (infile) { while(!infile.eof()) { getline(infile, strLine); strResult += strLine+"\n"; } } infile.close(); char* changeTemp=new char[strResult.length()]; strcpy(changeTemp, strResult.c_str()); char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp); strResult=changeResult; ofstream outfile; outfile.open(filename); outfile.write(strResult.c_str(),strResult.length()); outfile.flush(); outfile.close(); }
'; return szAnsi; }void change_encoding_from_UTF8_with_BOM_to_ANSI(const char* filename) { ifstream infile; string strLine=""; string strResult=""; infile.open(filename); if (infile) { // the first 3 bytes (ef bb bf) is UTF-8 header flags // all the others are single byte ASCII code. // should delete these 3 when output getline(infile, strLine); strResult += strLine.substr(3)+"\n"; while(!infile.eof()) { getline(infile, strLine); strResult += strLine+"\n"; } } infile.close(); char* changeTemp=new char[strResult.length()]; strcpy(changeTemp, strResult.c_str()); char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp); strResult=changeResult; ofstream outfile; outfile.open(filename); outfile.write(strResult.c_str(),strResult.length()); outfile.flush(); outfile.close(); } // change a char's encoding from UTF8 to ANSI char* change_encoding_from_UTF8_to_ANSI(char* szU8) { int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0); wchar_t* wszString = new wchar_t[wcsLen + 1]; ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen); wszString[wcsLen] = '
'; int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL); char* szAnsi = new char[ansiLen + 1]; ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL); szAnsi[ansiLen] = '##代码##'; return szAnsi; }void change_encoding_from_UTF8_without_BOM_to_ANSI(const char* filename) { ifstream infile; string strLine=""; string strResult=""; infile.open(filename); if (infile) { while(!infile.eof()) { getline(infile, strLine); strResult += strLine+"\n"; } } infile.close(); char* changeTemp=new char[strResult.length()]; strcpy(changeTemp, strResult.c_str()); char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp); strResult=changeResult; ofstream outfile; outfile.open(filename); outfile.write(strResult.c_str(),strResult.length()); outfile.flush(); outfile.close(); }
UTF8_without_BOMto ANSI
##代码##
UTF8_with_BOM到 ANSI
##代码##UTF8_without_BOM到 ANSI
##代码##
回答by Anatoly
As to phrase
至于短语
"If the string contains characters which do not exist in ASCII, then there is nothing you can do, because, well, those characters do not exist in ASCII."
“如果字符串包含 ASCII 中不存在的字符,那么您无能为力,因为,这些字符不存在于 ASCII 中。”
it's wrong.
这是错的。
UTF-8 is multibyte code set and may take more than 2 sets of symbols(languages). Practically you have either single language (English as usual) or 2 languages one of them is English.
UTF-8 是多字节代码集,可能需要 2 组以上的符号(语言)。实际上,您有一种语言(通常是英语)或两种语言,其中一种是英语。
- First case is simple ASCII char(any encoding).
- The second one describes ASCII char corresponding encoding. If it's not Chinese or Arabic.
- 第一种情况是简单的 ASCII 字符(任何编码)。
- 第二个描述ASCII字符对应的编码。如果不是中文或阿拉伯语。
In the conditions above you can convert UTF-8 to ASCII chars. Corresponding functional there is no in C++. So you can do it manually. It's easy detect two byte symbols from 1 byte. The high bit of the first byte is set for two byte ones and unset otherwise.
在上述条件下,您可以将 UTF-8 转换为 ASCII 字符。C++中没有对应的函数。所以你可以手动完成。从 1 个字节中检测两个字节符号很容易。第一个字节的高位设置为两个字节,否则不设置。