C++ 获取 std::ifstream 来处理 LF、CR 和 CRLF?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6089231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:28:48  来源:igfitidea点击:

Getting std :: ifstream to handle LF, CR, and CRLF?

c++ifstreamnewline

提问by Aaron McDaid

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getlineand have it gracefully handle all line endings.

具体来说,我对istream& getline ( istream& is, string& str );. ifstream 构造函数是否有一个选项可以告诉它在后台将所有换行符编码转换为 '\n' ?我希望能够调用getline并让它优雅地处理所有行尾。

Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.

更新:澄清一下,我希望能够编写几乎可以在任何地方编译的代码,并且几乎可以从任何地方获取输入。包括有 '\r' 没有 '\n' 的稀有文件。最大限度地减少软件的任何用户的不便。

It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.

解决这个问题很容易,但我仍然很好奇标准中的正确方法来灵活处理所有文本文件格式。

getlinereads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.

getline将整行读入一个字符串,直到一个 '\n'。'\n' 从流中消耗,但 getline 不将其包含在字符串中。到目前为止还好,但是在包含在字符串中的 '\n' 之前可能有一个 '\r'。

There are three types of line endingsseen in text files: '\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.

三种类型的行结尾的文本文件中看到:“\ n”是在Unix机器上,“\ r”的传统结局是在旧的Mac操作系统使用,Windows使用一对,“\ r”(我认为)后跟'\n'。

The problem is that getlineleaves the '\r' on the end of the string.

问题是getline将 '\r' 留在字符串的末尾。

ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
   // BUT, there might be an '\r' at the end now.
}

EditThanks to Neil for pointing out that f.good()isn't what I wanted. !f.fail()is what I want.

编辑感谢尼尔指出这f.good()不是我想要的。!f.fail()是我想要的。

I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!

我可以自己手动删除它(请参阅此问题的编辑),这对于 Windows 文本文件来说很容易。但我担心有人会输入一个只包含 '\r' 的文件。在这种情况下,我认为 getline 会消耗整个文件,认为它是一行!

.. and that's not even considering Unicode :-)

.. 这甚至没有考虑 Unicode :-)

.. maybe Boost has a nice way to consume one line at a time from any text-file type?

.. 也许 Boost 有一种很好的方法可以从任何文本文件类型中一次使用一行?

EditI'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.

编辑我正在使用它来处理 Windows 文件,但我仍然觉得我不应该这样做!这不会为 '\r'-only 文件分叉。

if(!line.empty() && *line.rbegin() == '\r') {
    line.erase( line.length()-1, 1);
}

回答by Johan R?de

As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."

正如 Neil 指出的那样,“C++ 运行时应该正确处理针对您的特定平台的任何行结束约定。”

However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):

但是,人们确实会在不同平台之间移动文本文件,因此这还不够好。这是一个处理所有三个行结尾的函数(“\r”、“\n”和“\r\n”):

std::istream& safeGetline(std::istream& is, std::string& t)
{
    t.clear();

    // The characters in the stream are read one-by-one using a std::streambuf.
    // That is faster than reading them one-by-one using the std::istream.
    // Code that uses streambuf this way must be guarded by a sentry object.
    // The sentry object performs various tasks,
    // such as thread synchronization and updating the stream state.

    std::istream::sentry se(is, true);
    std::streambuf* sb = is.rdbuf();

    for(;;) {
        int c = sb->sbumpc();
        switch (c) {
        case '\n':
            return is;
        case '\r':
            if(sb->sgetc() == '\n')
                sb->sbumpc();
            return is;
        case std::streambuf::traits_type::eof():
            // Also handle the case when the last line has no line ending
            if(t.empty())
                is.setstate(std::ios::eofbit);
            return is;
        default:
            t += (char)c;
        }
    }
}

And here is a test program:

这是一个测试程序:

int main()
{
    std::string path = ...  // insert path to test file here

    std::ifstream ifs(path.c_str());
    if(!ifs) {
        std::cout << "Failed to open the file." << std::endl;
        return EXIT_FAILURE;
    }

    int n = 0;
    std::string t;
    while(!safeGetline(ifs, t).eof())
        ++n;
    std::cout << "The file contains " << n << " lines." << std::endl;
    return EXIT_SUCCESS;
}

回答by Aaron McDaid

The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:

C++ 运行时应该正确处理特定平台的任何结束线约定。具体来说,此代码应适用于所有平台:

#include <string>
#include <iostream>
using namespace std;

int main() {
    string line;
    while( getline( cin, line ) ) {
        cout << line << endl;
    }
}

Of course, if you are dealing with files from another platform, all bets are off.

当然,如果您正在处理来自另一个平台的文件,那么所有赌注都将关闭。

As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the linestring in the above code to see if it is \rand if so remove it before doing your application-specific processing.

由于两个最常见的平台(Linux 和 Windows)都以换行符终止行,Windows 在其前面加一个回车符,因此您可以检查line上述代码中字符串的最后一个字符,看看它是否是\r,如果是在进行特定于应用程序的处理之前将其删除。

For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):

例如,您可以为自己提供一个看起来像这样的 getline 样式函数(未经测试,仅出于教学目的使用索引、substr 等):

ostream & safegetline( ostream & os, string & line ) {
    string myline;
    if ( getline( os, myline ) ) {
       if ( myline.size() && myline[myline.size()-1] == '\r' ) {
           line = myline.substr( 0, myline.size() - 1 );
       }
       else {
           line = myline;
       }
    }
    return os;
}

回答by Danilo J. Bonsignore

Are you reading the file in BINARYor in TEXTmode? In TEXTmode the pair carriage return/line feed, CRLF, is interpreted as TEXTend of line, or end of line character, but in BINARYyou fetch only ONEbyte at a time, which means that either character MUSTbe ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATICmeans that once you reach the farthest edge of the paper, the car is returned to the left ANDthe line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.

您是在BINARY还是TEXT模式下阅读文件?在TEXT模式下,回车/换行对CRLF被解释为TEXT行尾或行尾字符,但在BINARY 中,您一次只能获取一个字节,这意味着任一字符必须被忽略并留在缓冲区中作为另一个字节获取!回车是指在打字机中,打印臂所在的打字机小车已经到达纸张的右边缘并返回到左边缘。这是一个非常机械的模型,机械打字机的模型。然后换行意味着纸卷向上旋转一点,这样纸就可以开始另一行打字了。正如我记得的那样,ASCII 中的一个低位数字意味着向右移动一个字符而不输入,死字符,当然 \b 意味着退格:将汽车向后移动一个字符。这样您就可以添加特殊效果,例如底层(键入下划线)、删除线(键入减号)、近似不同的重音、取消(键入 X),而无需扩展键盘,只需在输入换行之前调整汽车沿线的位置即可。因此,您可以使用字节大小的 ASCII 电压来自动控制打字机,而无需计算机。当自动打字机问世时,自动意味着一旦您到达纸张的最远边缘,汽车将返回到左侧应用换行,即假设汽车在卷筒向上移动时自动返回!所以你不需要两个控制字符,只需要一个,\n、换行符或换行符。

This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...

这与编程无关,但 ASCII 更旧,嘿!看起来有些人在开始做文字事情的时候并没有想到!UNIX平台假设一个电动自动打字机;Windows 模型更完整,允许控制机械机器,尽管一些控制字符在计算机中变得越来越少用,例如钟形字符,如果我没记错的话是 0x07...一些被遗忘的文本最初肯定是用控制字符捕获的对于电控打字机,它延续了这个模型......

Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:

实际上正确的变化是只包括\r,换行,回车是不必要的,即自动的,因此:

char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');

//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...

would be the most correct way to handle all types of files. Note however that \n in TEXTmode is actually the byte pair 0x0d 0x0a, but 0x0d ISjust \r: \n includes \r in TEXTmode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXTmode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).

将是处理所有类型文件的最正确方法。然而要注意\ n的文本模式实际上是字节对0X0D 0X0A,但0X0D IS只是\ R:\ n包括在\ r TEXT模式,而不是在BINARY,所以\ n和\ r \ n为相当于...或应该。这实际上是一个非常基本的行业混乱,典型的行业惯性,按照惯例,在所有平台上都讲CRLF,然后陷入不同的二进制解释。严格来说,包含0x0d(回车)作为 \n(CRLF 或换行)的文件在TEXT中格式不正确模式(打字机:只需返回汽车并删除所有内容...),并且是一种非面向行的二进制格式(\r 或 \r\n 表示面向行),因此您不应以文本形式阅读!代码应该会失败,可能会出现一些用户消息。这不仅取决于操作系统,还取决于 C 库实现,增加了混乱和可能的变化......(特别是对于透明的 UNICODE 翻译层,为混乱的变化添加了另一个表达点)。

The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARYmode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARYis better than TEXTmode. This solution is efficient for BINARYmode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.

前面的代码片段(机械打字机)的问题在于,如果 \r(自动打字机文本)后面没有 \n 字符,则效率非常低。然后它还假定BINARY模式,其中 C 库被迫忽略文本解释(语言环境)并放弃纯粹的字节。两种模式的实际文本字符应该没有区别,只有控制字符,所以一般来说读取BINARYTEXT模式好。此解决方案对BINARY有效模式独立于 C 库变体的典型 Windows 操作系统文本文件,并且对于其他平台文本格式(包括网络翻译成文本)效率低下。如果您关心效率,那么要走的路是使用函数指针,以您喜欢的方式对 \r 与 \r\n 行控件进行测试,然后选择最佳的 getline 用户代码到指针中并从它。

Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.

顺便说一句,我记得我也发现了一些 \r\r\n 文本文件......它可以转换成双行文本,就像一些印刷文本消费者仍然需要的那样。

回答by user2061057

One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.

一种解决方案是首先搜索并将所有行结尾替换为 '\n' - 就像 Git 默认情况下所做的那样。

回答by user2061057

Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1]is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.

除了编写自己的自定义处理程序或使用外部库之外,您的运气不佳。最简单的方法是检查以确保line[line.length() - 1]不是 '\r'。在 Linux 上,这是多余的,因为大多数行将以 '\n' 结尾,这意味着如果这是在循环中,您将损失相当多的时间。在 Windows 上,这也是多余的。但是,以“\r”结尾的经典 Mac 文件呢?std::getline 不适用于 Linux 或 Windows 上的这些文件,因为 '\n' 和 '\r' '\n' 都以 '\n' 结尾,无需检查 '\r'。显然,处理这些文件的任务不会很好地工作。当然,还有大量的 EBCDIC 系统,这是大多数图书馆都不敢处理的。

Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.

检查 '\r' 可能是您问题的最佳解决方案。以二进制模式读取将允许您检查所有三个常见的行尾('\r'、'\r\n' 和 '\n')。如果您只关心 Linux 和 Windows,因为旧式 Mac 行结尾不应该存在太久,请仅检查 '\n' 并删除尾随的 '\r' 字符。

回答by Martin Thümmel

If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as

如果知道每行有多少个项目/数字,则可以将一行读取为例如 4 个数字

string num;
is >> num >> num >> num >> num;

This also works with other line endings.

这也适用于其他行尾。