C++ 在没有外部库的情况下将 utf-16 文本文件正确读入字符串？

Question

提问by neminem

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feelslike it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

我从一开始就一直在使用 StackOverflow，有时也很想发布问题，但我总是要么自己弄清楚，要么找到最终发布的答案......直到现在。这感觉应该相当简单，但我已经在互联网上徘徊了几个小时没有成功，所以我转向这里：

I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of relatedquestions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

我有一个非常标准的 utf-16 文本文件，混合了英文和中文字符。我希望这些字符以字符串结尾（从技术上讲，是 wstring）。我已经看到很多相关问题的回答（这里和其他地方），但他们要么希望解决在不知道编码的情况下读取任意文件或在编码之间进行转换的更困难的问题，要么只是通常对“Unicode”感到困惑" 是一系列编码。我知道我试图阅读的文本文件的来源，它总是 UTF16，它有一个 BOM 和所有东西，它可以保持这种状态。

I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

我一直在使用这里描述的解决方案，该解决方案适用于全英文的文本文件，但在遇到某些字符后，它停止读取文件。我发现的唯一其他建议是使用ICU，这可能会起作用，但我真的宁愿不在应用程序中包含整个大型库进行分发，只是在一个地方读取一个文本文件。不过，我不关心系统独立性——我只需要它来编译和在 Windows 中工作。不依赖于该事实的解决方案会更漂亮，当然，但我会很高兴使用 stl 的解决方案同时依赖于对 Windows 体系结构的假设，甚至是涉及 win32 函数或 ATL 的解决方案；我只是不想包含另一个像 ICU 这样的大型 3rd 方库。除非我想自己重新实现它，否则我仍然完全不走运吗？

edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

编辑：对于这个特定项目，我一直在使用 VS2008，所以很遗憾 C++11 代码无济于事。

edit 2: I realized that the codeI had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them '：' (FULLWIDTH COLON, U+FF1A) and '）' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

编辑 2：我意识到我之前借用的代码并没有像我想象的那样在非英文字符上失败。相反，它在我的测试文档中的特定字符上失败，其中包括 '：' (FULLWIDTH COLON, U+FF1A) 和 ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09)。bames53 发布的解决方案也大多有效，但被那些相同的角色难住了？

edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

编辑 3（和答案！）：我一直使用的原始代码 - 确实 - 大部分工作 - 正如 bames53 帮助我发现的那样，ifstream 只需要以二进制模式打开即可工作。

Answer 1

回答by Cubbi

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:

C++11 解决方案（据我所知，在您的平台上，自 2010 年起由 Visual Studio 支持）将是：

#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
    // open as a byte stream
    std::wifstream fin("text.txt", std::ios::binary);
    // apply BOM-sensitive UTF-16 facet
    fin.imbue(std::locale(fin.getloc(),
       new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    // read     
    for(wchar_t c; fin.get(c); )
            std::cout << std::showbase << std::hex << c << '\n';
}

Answer 2

回答by Mark Ransom

When you open a file for UTF-16, you mustopen it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

当您打开 UTF-16 文件时，您必须以二进制模式打开它。这是因为在文本模式下，某些字符会被特殊解释——具体来说，0x0d 被完全过滤掉，0x1a 标志着文件的结尾。有一些 UTF-16 字符会将这些字节之一作为字符代码的一半，并且会扰乱文件的读取。这不是错误，而是有意的行为，并且是将文本模式和二进制模式分开的唯一原因。

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chentracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

对于 0x1a 被视为文件结尾的原因，请参阅Raymond Chen 的这篇博客文章，追溯 Ctrl-Z 的历史。它基本上是向后兼容的。

Answer 3

回答by bames53

Edit:

编辑：

So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.

因此，问题似乎在于 Windows 将某些魔术字节序列视为文本模式下文件的结尾。这是通过使用二进制模式读取文件，std::ifstream fin("filename", std::ios::binary);然后像您已经做的那样将数据复制到 wstring 来解决的。

The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.

最简单、不可移植的解决方案是将文件数据复制到 wchar_t 数组中。这依赖于 Windows 上的 wchar_t 为 2 个字节并使用 UTF-16 作为其编码的事实。

You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.

以完全可移植的方式将 UTF-16 转换为特定于语言环境的 wchar_t 编码会有些困难。

Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)

这是标准 C++ 库中可用的 unicode 转换功能（尽管 VS 10 和 11 仅实现了第 3、4 和 5 项）

codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16

codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16

And what each one does

每个人做什么

A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__is defined these functions convert between the current locale's char encoding and UTF-16

始终在 UTF-8 和 UTF-32 之间转换的 codecvt 方面
在 UTF-8 和 UTF-16 之间转换
根据目标元素的大小在 UTF-8 和 UCS-2 或 UCS-4 之间转换（BMP 之外的字符可能会被截断）
在使用 UTF-16 编码方案和 UCS-2 或 UCS-4 的字符序列之间进行转换
在 UTF-8 和 UTF-16 之间转换
如果__STDC_UTF_32__定义了宏，这些函数会在当前语言环境的字符编码和 UTF-32 之间进行转换
如果__STDC_UTF_16__定义了宏，这些函数会在当前语言环境的字符编码和 UTF-16 之间进行转换

If __STDC_ISO_10646__is defined then converting directly using codecvt_utf16<wchar_t>should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).

如果__STDC_ISO_10646__定义了则直接使用转换codecvt_utf16<wchar_t>应该没问题，因为该宏指示所有语言环境中的 wchar_t 值对应于 Unicode 宪章的短名称（因此暗示 wchar_t 足够大以容纳任何此类值）。

Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.

不幸的是，没有任何定义可以直接从 UTF-16 转到 wchar_t。可以使用 UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc，但是您会丢失在语言环境的多字节编码中无法表示的任何内容。当然，无论如何，从 UTF-16 转换为 wchar_t 都会丢失在语言环境的 wchar_t 编码中无法表示的任何内容。

So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.

因此，它可能不值得移植，相反，您可以将数据读入 wchar_t 数组，或使用其他一些 Windows 特定工具，例如文件上的 _O_U16TEXT 模式。

This should build and run anywhere, but makes a bunch of assumptions to actually work:

这应该可以在任何地方构建和运行，但要做出一系列假设才能实际工作：

#include <fstream>
#include <sstream>
#include <iostream>

int main ()
{
    std::stringstream ss;
    std::ifstream fin("filename");
    ss << fin.rdbuf(); // dump file contents into a stringstream
    std::string const &s = ss.str();
    if (s.size()%sizeof(wchar_t) != 0)
    {
        std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
        return 1;
    }
    std::wstring ws;
    ws.resize(s.size()/sizeof(wchar_t));
    std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}

You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

您可能至少应该添加代码来处理字节序和“BOM”。此外，Windows 换行符不会自动转换，因此您需要手动转换。

C++ 在没有外部库的情况下将 utf-16 文本文件正确读入字符串？

提问by neminem

回答by Cubbi

回答by Mark Ransom

回答by bames53

Edit:

编辑：

相关推荐

最近更新

标签

C++ 在没有外部库的情况下将 utf-16 文本文件正确读入字符串？

提问by neminem

回答by Cubbi

回答by Mark Ransom

回答by bames53

Edit:

编辑：

相关推荐

在 C++ 中从标准输入读取

C++ 什么比双倍大？

C++ 删除向量、对象、空闲内存

C++ 如何修复此错误：#include <gl/glut.h>“无法打开源文件 gl/glut.h”

相关推荐

最近更新

标签