如何在 C++ 中使用 UTF-8，从其他编码转换为 UTF-8

Question

提问by Christoph

I don't know how to solve that:

我不知道如何解决：

Imagine, we have 4 websites:

想象一下，我们有 4 个网站：

A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16

答：UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16

My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">"or "<".

我用 C++ 编写的程序执行以下操作：它下载一个网站并解析它。但它必须了解内容。我的问题不是使用 ASCII 字符（如">"或）完成的解析"<"。

The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:

问题是程序应该从网站的文本中找出所有的单词。单词是字母数字字符的任意组合。然后我将这些词发送到服务器。数据库和 Web 前端使用 UTF-8。所以我的问题是：

How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_tdoes not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower()for such UTF-8-strings?

如何将“任何”（或最常用的）字符编码转换为 UTF-8？
如何在 C++ 中使用 UTF-8 字符串？我认为wchar_t不起作用，因为它有 2 个字节长。UTF-8 中的代码点最长为 4 个字节...
对于此类 UTF-8 字符串，是否有类似isspace(), isalnum(), 的函数？strlen()tolower()

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

请注意：我std::cout在 C++ 中不做任何输出（如）。只需过滤掉单词并将它们发送到服务器即可。

I know about UTF8-CPP but it has no is*()functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.

我知道 UTF8-CPP 但它没有任何is*()功能。正如我所读到的，它不会从其他字符编码转换为 UTF-8。仅从 UTF-* 到 UTF-8。

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

编辑：我忘了说，该程序必须是可移植的：Windows、Linux、...

Answer 1

回答by DevSolar

How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任何”（或最常用的）字符编码转换为 UTF-8？

ICU(International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

ICU（Unicode 的国际组件）是这里的解决方案。它通常被认为是对 Unicode 支持的最后发言权。当涉及到 Unicode 时，甚至 Boost.Locale 和 Boost.Regex 也使用它。关于为什么我建议直接使用 ICU 而不是包装器（如 Boost），请参阅我对 Dory Zidon 的回答的评论。

You create a converter for a given encoding...

您为给定的编码创建一个转换器...

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...and then use the UnicodeStringclass as appripriate.

...然后使用UnicodeString类作为适当的。

I think wchar_t does not work because it is 2 bytes long.

我认为 wchar_t 不起作用，因为它有 2 个字节长。

The size of wchar_tis implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't defineUnicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

的大小wchar_t是实现定义的。AFAICR，Windows 是 2 字节（UCS-2 / UTF-16，取决于 Windows 版本），Linux 是 4 字节（UTF-32）。在任何情况下，由于标准没有为定义Unicode 语义wchar_t，因此使用它是不可移植的猜测。别猜了，用ICU。

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

对于此类 UTF-8 字符串，是否有 isspace()、isalnum()、strlen()、tolower() 等函数？

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

不是在他们的 UTF-8 编码中，但无论如何你都不会在内部使用它。UTF-8 适用于外部表示，但内部 UTF-16 或 UTF-32 是更好的选择。对于 Unicode 代码点（即 UChar32），确实存在上述函数；参考 uchar.h。

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

请注意：我不在 C++ 中做任何输出（如 std::cout）。只需过滤掉单词并将它们发送到服务器即可。

Check BreakIterator.

检查BreakIterator。

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

编辑：我忘了说，该程序必须是可移植的：Windows、Linux、...

In case I haven't said it already, douse ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it isthe best implementation out there, it isextremely portable (using it on Windows, Linux, and AIX myself), and you willuse it again and again and again in projects to come, so time invested in learning its API is not wasted.

如果我还没有说过，请使用 ICU，并为自己省去很多麻烦。即使乍一看似乎有点重量级的，它是目前最好的实现，它是便携至极（使用它在Windows，Linux和AIX我自己），你会在项目中一次又一次又一次用它来来，所以花在学习其 API 上的时间不会浪费。

Answer 2

回答by Dory Zidon

No sure if this will give you everything you're looking for but it might help a little. Have you tried looking at:

不确定这是否会给你你正在寻找的一切，但它可能会有所帮助。你有没有试过看：

1) Boost.Locale library ? Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

1) Boost.Locale 库？Boost.Locale 在 Boost 1.48（2011 年 11 月 15 日）中发布，使得与 UTF8/16 之间的转换更容易

Here are some convenient examples from the docs:

以下是文档中的一些方便示例：

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

2) Or at conversions are part of C++11?

2) 或者转换是 C++11 的一部分？

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

Answer 3

回答by Jakob Riedle

How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...

如何在 C++ 中使用 UTF-8 字符串？我认为 wchar_t 不起作用，因为它有 2 个字节长。UTF-8 中的代码点最长为 4 个字节...

This is easy, there is a project named tinyutf8 , which is a drop-in replacementfor std::string/std::wstring.

这是很容易，有一个名为项目 tinyutf8 ，这是一个简易替换为std::string/ std::wstring。

Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.

然后用户可以优雅地对codepoints进行操作，而它们的表示总是以chars编码。

How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任何”（或最常用的）字符编码转换为 UTF-8？

You might want to have a look at std::codecvt_utf8and simlilar templatesfrom <codecvt>(C++11).

你可能想看看std::codecvt_utf8和simlilar模板从<codecvt>（C ++ 11）。

Answer 4

回答by Joop Eggen

UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/'inside of a multi-byte sequence. And isdigitworks (though not arabic and other digits).

UTF-8 是一种使用第 8 位的非 ASCII（7 位代码）使用多个字节的编码。因此，您不会在多字节序列中找到'\', '/'。并且isdigit有效（虽然不是阿拉伯语和其他数字）。

It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.

它是 ASCII 的超集，可以容纳所有 Unicode 字符，因此绝对可以与 char 和 string 一起使用。

Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.

检查 HTTP 标头（不区分大小写）；它们在 ISO-8859-1 中，在空行之前，然后是 HTML 内容。

Content-Type: text/html; charset=UTF-8

If not present, there also there might be

如果不存在，也可能有

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">      <!-- HTML5 -->

ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such. Even browsers on MacOS will understand these though ISO-8859-1 was specified.

ISO-8859-1 是拉丁语 1，您可能会更好地从 Windows-1252（Windows Latin-1 扩展名使用 0x80 - 0xBF 用于某些特殊字符（如逗号引号等））进行转换。尽管指定了 ISO-8859-1，但即使是 MacOS 上的浏览器也会理解这些。

Conversion libraries: alread mentioned by @syam.

转换库：@syam 已经提到。

Conversion

转换

Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.

我们不考虑 UTF-16。可以读取标题并开始直到字符集的元语句作为单字节字符。

The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[]indexed by the char.

从单字节编码到 UTF-8 的转换可以通过表进行。例如用 Java 生成：一个const char* table[]由字符索引的。

table[157] = "\xEF\xBF\xBD";


public static void main(String[] args) {
    final String SOURCE_ENCODING = "windows-1252";
    byte[] sourceBytes = new byte[1];
    System.out.println("    const char* table[] = {");
    for (int c = 0; c < 256; ++c) {
        String comment = "";
        System.out.printf("       /* %3d */ \"", c);
        if (32 <= c && c < 127) {
            // Pure ASCII
            if (c == '\"' || c == '\')
                System.out.print("\");
            System.out.print((char)c);
        } else {
            if (c == 0) {
                comment = " // Unusable";
            }
            sourceBytes[0] = (byte)c;
            try {
                byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
                for (int j = 0; j < targetBytes.length; ++j) {
                    int b = targetBytes[j] & 0xFF;
                    System.out.printf("\x%02X", b);
                }
            } catch (UnsupportedEncodingException ex) {
                comment = " // " + ex.getMessage().replaceAll("\s+", " "); // No newlines.
            }
        }
        System.out.print("\"");
        if (c < 255) {
            System.out.print(",");
        }
        System.out.println();
    }
    System.out.println("    };");
}

如何在 C++ 中使用 UTF-8，从其他编码转换为 UTF-8

提问by Christoph

回答by DevSolar

回答by Dory Zidon

回答by Jakob Riedle

回答by Joop Eggen

相关推荐

最近更新

标签

如何在 C++ 中使用 UTF-8，从其他编码转换为 UTF-8

提问by Christoph

回答by DevSolar

回答by Dory Zidon

回答by Jakob Riedle

回答by Joop Eggen

相关推荐

是否有在 C/C++ 中复制数组的函数？

UINT32_MAX 的 C++ 等价物是什么？

C++ 什么时候 uint8_t ≠ unsigned char？

C++ 从 [0.5 - 1] 归一化到 [0 - 1]

相关推荐

最近更新

标签