在 C/C++ 中将 ISO-8859-1 字符串转换为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4059775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 14:24:49  来源:igfitidea点击:

Convert ISO-8859-1 strings to UTF-8 in C/C++

c++c

提问by gordonwd

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

您可能会认为这很容易获得,但我很难找到一个简单的库函数,它将 C 或 C++ 字符串从 ISO-8859-1 编码转换为 UTF-8。我正在读取 8 位 ISO-8859-1 编码的数据,但需要将其转换为 UTF-8 字符串以用于 SQLite 数据库和最终的 Android 应用程序。

I found one commercial product, but it's beyond my budget at this time.

我找到了一种商业产品,但目前超出了我的预算。

回答by R.. GitHub STOP HELPING ICE

If your source encoding will alwaysbe ISO-8859-1, this is trivial. Here's a loop:

如果您的源编码将始终为 ISO-8859-1,则这是微不足道的。这是一个循环:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

为了安全起见,您需要确保输出缓冲区是输入缓冲区的两倍大,否则包括大小限制并在循环条件中检查它。

回答by Lord Raiden

To c++ i use this:

对于 C++ 我使用这个:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

回答by Spacemoose

You can use the boost::locale library:

您可以使用 boost::locale 库:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

代码如下所示:

#include <boost/locale.hpp>
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");

回答by cytrinox

The C++03 standard does not provide functions to directly convert between specific charsets.

C++03 标准没有提供直接在特定字符集之间转换的函数。

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

根据您的操作系统,您可以在 Linux 上使用 iconv(),在 Windows 上使用 MultiByteToWideChar() & Co.。为字符串转换提供大量支持的库是开源的 ICU 库。

回答by RBerteig

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this onewhich maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

如果面对 Windows 1252 而不是真正的 ISO-8859-1,Unicode 人员有一些表可能会有所帮助。明确一个似乎是这一个这CP1252每个代码点映射到的Unicode代码点。将 Unicode 编码为 UTF-8 是一个简单的练习。

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

直接解析该表并在编译时从中形成查找表并不困难。

回答by Cheers and hth. - Alf

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

ISO-8859-1 到 UTF-8 只涉及编码算法,因为 ISO-8859-1 是 Unicode 的子集。所以你已经有了 Unicode 代码点。检查维基百科的算法。

The C++ aspects -- integrating that with iostreams -- are much harder.

C++ 方面——将它与iostreams 集成——要困难得多。

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

我建议你绕着那座山走一走,而不是试图钻穿它或爬上它,也就是说,实现一个简单的字符串到字符串转换器。

Cheers & hth.,

干杯 & hth.,