在 C/C++ 中将 ISO-8859-1 字符串转换为 UTF-8

Question

提问by gordonwd

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

您可能会认为这很容易获得，但我很难找到一个简单的库函数，它将 C 或 C++ 字符串从 ISO-8859-1 编码转换为 UTF-8。我正在读取 8 位 ISO-8859-1 编码的数据，但需要将其转换为 UTF-8 字符串以用于 SQLite 数据库和最终的 Android 应用程序。

I found one commercial product, but it's beyond my budget at this time.

我找到了一种商业产品，但目前超出了我的预算。

Answer 1

回答by R.. GitHub STOP HELPING ICE

If your source encoding will alwaysbe ISO-8859-1, this is trivial. Here's a loop:

如果您的源编码将始终为 ISO-8859-1，则这是微不足道的。这是一个循环：

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

为了安全起见，您需要确保输出缓冲区是输入缓冲区的两倍大，否则包括大小限制并在循环条件中检查它。

Answer 2

回答by Lord Raiden

To c++ i use this:

对于 C++ 我使用这个：

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

Answer 3

回答by Spacemoose

You can use the boost::locale library:

您可以使用 boost::locale 库：

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

代码如下所示：

#include <boost/locale.hpp>
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");

Answer 4

回答by cytrinox

The C++03 standard does not provide functions to directly convert between specific charsets.

C++03 标准没有提供直接在特定字符集之间转换的函数。

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

根据您的操作系统，您可以在 Linux 上使用 iconv()，在 Windows 上使用 MultiByteToWideChar() & Co.。为字符串转换提供大量支持的库是开源的 ICU 库。

Answer 5

回答by RBerteig

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this onewhich maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

如果面对 Windows 1252 而不是真正的 ISO-8859-1，Unicode 人员有一些表可能会有所帮助。明确一个似乎是这一个这CP1252每个代码点映射到的Unicode代码点。将 Unicode 编码为 UTF-8 是一个简单的练习。

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

直接解析该表并在编译时从中形成查找表并不困难。

Answer 6

回答by Cheers and hth. - Alf

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

ISO-8859-1 到 UTF-8 只涉及编码算法，因为 ISO-8859-1 是 Unicode 的子集。所以你已经有了 Unicode 代码点。检查维基百科的算法。

The C++ aspects -- integrating that with iostreams -- are much harder.

C++ 方面——将它与iostreams 集成——要困难得多。

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

我建议你绕着那座山走一走，而不是试图钻穿它或爬上它，也就是说，实现一个简单的字符串到字符串转换器。

Cheers & hth.,

干杯 & hth.,

在 C/C++ 中将 ISO-8859-1 字符串转换为 UTF-8

提问by gordonwd

回答by R.. GitHub STOP HELPING ICE

回答by Lord Raiden

回答by Spacemoose

回答by cytrinox

回答by RBerteig

回答by Cheers and hth. - Alf

相关推荐

最近更新

标签

在 C/C++ 中将 ISO-8859-1 字符串转换为 UTF-8

提问by gordonwd

回答by R.. GitHub STOP HELPING ICE

回答by Lord Raiden

回答by Spacemoose

回答by cytrinox

回答by RBerteig

回答by Cheers and hth. - Alf

相关推荐

如何用 C 或 C++ 编写一个简单的正则表达式模式匹配函数？

C++ 数组中的最大值

C++ 一个类有多少个默认方法？

在 OS X 上安装 C++ 库

相关推荐

最近更新

标签