在 C/C++ 中将 ISO-8859-1 字符串转换为 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4059775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert ISO-8859-1 strings to UTF-8 in C/C++
提问by gordonwd
You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.
您可能会认为这很容易获得,但我很难找到一个简单的库函数,它将 C 或 C++ 字符串从 ISO-8859-1 编码转换为 UTF-8。我正在读取 8 位 ISO-8859-1 编码的数据,但需要将其转换为 UTF-8 字符串以用于 SQLite 数据库和最终的 Android 应用程序。
I found one commercial product, but it's beyond my budget at this time.
我找到了一种商业产品,但目前超出了我的预算。
回答by R.. GitHub STOP HELPING ICE
If your source encoding will alwaysbe ISO-8859-1, this is trivial. Here's a loop:
如果您的源编码将始终为 ISO-8859-1,则这是微不足道的。这是一个循环:
unsigned char *in, *out;
while (*in)
if (*in<128) *out++=*in++;
else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;
For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.
为了安全起见,您需要确保输出缓冲区是输入缓冲区的两倍大,否则包括大小限制并在循环条件中检查它。
回答by Lord Raiden
To c++ i use this:
对于 C++ 我使用这个:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
回答by Spacemoose
You can use the boost::locale library:
您可以使用 boost::locale 库:
http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html
http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html
The code would look like this:
代码如下所示:
#include <boost/locale.hpp>
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
回答by cytrinox
The C++03 standard does not provide functions to directly convert between specific charsets.
C++03 标准没有提供直接在特定字符集之间转换的函数。
Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.
根据您的操作系统,您可以在 Linux 上使用 iconv(),在 Windows 上使用 MultiByteToWideChar() & Co.。为字符串转换提供大量支持的库是开源的 ICU 库。
回答by RBerteig
The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this onewhich maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.
如果面对 Windows 1252 而不是真正的 ISO-8859-1,Unicode 人员有一些表可能会有所帮助。明确一个似乎是这一个这CP1252每个代码点映射到的Unicode代码点。将 Unicode 编码为 UTF-8 是一个简单的练习。
It would not be difficult to parse that table directly and form a lookup table from it at compile time.
直接解析该表并在编译时从中形成查找表并不困难。
回答by Cheers and hth. - Alf
ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.
ISO-8859-1 到 UTF-8 只涉及编码算法,因为 ISO-8859-1 是 Unicode 的子集。所以你已经有了 Unicode 代码点。检查维基百科的算法。
The C++ aspects -- integrating that with iostreams -- are much harder.
C++ 方面——将它与iostreams 集成——要困难得多。
I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.
我建议你绕着那座山走一走,而不是试图钻穿它或爬上它,也就是说,实现一个简单的字符串到字符串转换器。
Cheers & hth.,
干杯 & hth.,