查找文本行数的最快方法（C++）

Question

提问by systemsfault

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).

在对该文件执行某些操作之前，我需要读取文件中的行数。当我尝试读取文件并在每次迭代时增加 line_count 变量，直到达到 eof。就我而言，它并没有那么快。我使用了 ifstream 和 fgets 。他们俩都很慢。有没有一种方法可以做到这一点，例如 BSD、Linux 内核或 berkeley db 也使用它。（可能是通过使用按位运算）。

As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.

正如我之前所说的那样，该文件中有数百万行，而且越来越大，每行大约有 40 或 50 个字符。我正在使用 Linux。

Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.

注意：我敢肯定有人会说使用 DB 白痴。但简而言之，在我的情况下，我不能使用 db。

Answer 1

采纳答案by systemsfault

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way tom do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

找到行数的唯一方法是读取整个文件并计算行尾字符数。汤姆这样做的最快方法可能是通过一次读取操作将整个文件读入一个大缓冲区，然后通过缓冲区计算 '\n' 字符。

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks., say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

由于您当前的文件大小似乎约为 60Mb，因此这不是一个有吸引力的选择。您可以通过不读取整个文件，而是分块读取来获得一些速度，比如大小为 1Mb。你还说数据库是不可能的，但它确实看起来是最好的长期解决方案。

Edit:I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

编辑：我刚刚对此进行了一个小型基准测试，使用缓冲方法（缓冲区大小 1024K）似乎比使用 getline() 一次读取一行的速度快两倍多。这是代码 - 我的测试是使用 -O2 优化级别使用 g++ 完成的：

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
        if ( p[i] == '\n' ) {
            newlines++;
        }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
        cout << "lines\n";
        ifstream ifs( "lines.dat" );
        int n = 0;
        string s;
        while( getline( ifs, s ) ) {
            n++;
        }
        cout << n << endl;
    }
    else {
        cout << "buffer\n";
        const int SZ = 1024 * 1024;
        std::vector <char> buff( SZ );
        ifstream ifs( "lines.dat" );
        int n = 0;
        while( int cc = FileRead( ifs, buff ) ) {
            n += CountLines( buff, cc );
        }
        cout << n << endl;
    }
    cout << time(0) - now << endl;
}

Answer 2

回答by Pete Kirkham

Don't use C++ stl strings and getline( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

不要使用 C++ stl 字符串和getline（或 C 的 fgets），只使用 C 风格的原始指针，或者在页面大小的块中块读取或映射文件。

Then scan the block at the native word size of your system ( ie either uint32_tor uint64_t) using one of the magic algorithms'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLLin it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

然后使用一种魔术算法“寄存器内的 SIMD (SWAR) 操作”以系统的本机字大小（即，uint32_t或uint64_t）扫描块，以测试字内的字节。一个例子在这里；带有的循环扫描换行符。（该代码每个输入字节大约有 5 个周期，与文件每一行上的正则表达式匹配）0x0a0a0a0a0a0a0a0aLL

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

如果文件只有几十或一百兆字节左右，并且它一直在增长（即有东西不断向它写入），那么 linux 很有可能将它缓存在内存中，因此它不会受到磁盘 IO 的限制，但内存带宽有限。

If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.

如果文件只是被附加到，您还可以记住行数和以前的长度，并从那里开始。

It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

有人指出，您可以将 mmap 与 C++ stl 算法一起使用，并创建一个函子以传递给 std::foreach。我建议你不应该这样做，因为你不能那样做，但是编写额外的代码来这样做没有任何好处。或者你可以使用 boost 的 mmapped 迭代器，它会为你处理这一切；但是对于这个问题，我链接到的代码是为此编写的要慢得多，而且问题是关于速度而不是风格。

Answer 3

回答by Ludwig Weinzierl

You wrote that it keeps get larger. This sounds like it is a log file or something similar where new lines are appended but exisiting lines are not changed. If this is the case you could try an incremental approach.

你写道它不断变大。这听起来像是一个日志文件或类似的东西，其中附加了新行但不更改现有行。如果是这种情况，您可以尝试增量方法。

Parse to the end of file. Remember the line count and the offset of EOF. When the file grows fseekto the offset, parse to EOF and update the line count and the offset.

解析到文件末尾。记住行数和 EOF 的偏移量。当文件增长fseek到偏移量时，解析为EOF并更新行数和偏移量。

Answer 4

回答by Adrian McCarthy

There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:

计数线和计数线分隔符之间存在差异。如果获得确切的行数很重要，需要注意一些常见的问题：

What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR(or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEEDor U+000B LINE TABULATION(a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n'for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline()approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline()to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.

文件编码是什么？逐字节的解决方案适用于 ASCII 和 UTF-8，但请注意您是否使用 UTF-16 或某些不能保证具有换行值的字节必然编码换行的多字节编码。
许多文本文件在最后一行的末尾没有行分隔符。因此，如果您的文件显示"Hello, World!"，您最终可能会得到 0 而不是 1 的计数。您不仅需要计算行分隔符，还需要一个简单的状态机来跟踪。
一些非常晦涩的文件使用 Unicode U+2028 LINE SEPARATOR（甚至U+2029 PARAGRAPH SEPARATOR）作为行分隔符，而不是更常见的回车和/或换行符。您可能还需要注意U+0085 NEXT LINE (NEL).
您必须考虑是否要将一些其他控制字符计为换行符。例如，是否应该考虑将U+000C FORM FEED或U+000B LINE TABULATION（又名垂直制表符）转到新行？
来自旧版 Mac OS（OS X 之前）的文本文件使用回车符 ( U+000D) 而不是换行符 ( U+000A) 来分隔行。如果您将原始字节读入缓冲区（例如，使用二进制模式的流）并扫描它们，您将在这些文件上得到 0 的计数。您不能同时计算回车和换行符，因为 PC 文件通常以两者结束一行。同样，您将需要一个简单的状态机。（或者，您可以以文本模式而不是二进制模式读取文件。文本接口会将行分隔符标准化'\n'为符合您平台上使用的约定的文件。如果您正在从其他平台读取文件，您将使用状态机返回二进制模式。）
如果文件中有超长行，该getline()方法可能会引发异常，导致简单行计数器在少量文件上失败。（如果您在非 Mac 平台上读取旧的 Mac 文件，则尤其如此，导致getline()将整个文件视为一个巨大的行。）通过将块读入固定大小的缓冲区并使用状态机，您可以使其防弹。

The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.

接受的答案中的代码受到大多数这些陷阱的影响。在快速完成之前先做对。

Answer 5

回答by Martin York

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.

请记住，所有 fstream 都已缓冲。因此，它们实际上确实是按块读取的，因此您不必重新创建此功能。所以你需要做的就是扫描缓冲区。不要使用 getline() ，因为这会迫使您调整字符串的大小。所以我只会使用 STL std::count 和流迭代器。

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count;            // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}

Answer 6

回答by user88637

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is nofaster algorithm that can optimize your program.

不是因为你的算法慢，而是因为 IO 操作很慢。我想你正在使用一个简单的 O(n) 算法，它只是按顺序遍历文件。在这种情况下，没有更快的算法可以优化您的程序。

However, I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.

但是，我说没有更快的算法，但有一种更快的机制，称为“内存映射文件”，映射文件有一些缺点，可能不适合您的情况，因此您必须阅读有关它的信息并自己弄清楚。

Memory mapped files won't let you implement an algorithm better then O(n) but it maywill reduce IO access time.

内存映射文件不会让您实现比 O(n) 更好的算法，但它可能会减少 IO 访问时间。

Answer 7

回答by paxdiablo

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.

您只能通过扫描整个文件寻找换行符来获得明确的答案。没有办法解决这个问题。

However, there are a couple of possibilities which you may want to consider.

但是，您可能需要考虑几种可能性。

1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.

1/ 如果你使用一个简单的循环，一次读取一个字符检查换行符，不要。尽管 I/O 可能会被缓冲，但函数调用本身在时间上很昂贵。

A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr()should do it.

更好的选择是使用单个 I/O 操作将文件的大块（比如 5M）读入内存，然后进行处理。您可能不需要过多担心特殊的汇编指令，因为无论如何都会优化 C 运行时库 - 一个简单的strchr()应该做。

2/ If you're saying that the general line length is about 40-50 characters and you don't need an exactline count, just grab the file size and divide by 45 (or whatever average you deem to use).

2/ 如果您说一般行长度约为 40-50 个字符并且您不需要确切的行数，只需获取文件大小并除以 45（或您认为使用的任何平均值）。

3/ If this is something like a log file and you don't haveto keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.

3/ 如果这是一个类似于日志文件的文件，而您不必将其保存在一个文件中（可能需要对系统的其他部分进行返工），请考虑定期拆分文件。

For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.loglog file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.

例如，当它达到 5M 时，将其（例如，x.log）移动到一个过时的文件名（例如，x_20090101_1022.log）并计算出该点有多少行（将其存储在中x_20090101_1022.count，然后开始一个新的x.log日志文件。日志的特征files 意味着创建的这个过时的部分永远不会改变，所以你永远不必重新计算行数。

To process the log "file", you'd just cat x_*.logthrough some process pipe rather than cat x.log. To get the line count of the "file", do a wc -lon the current x.log (relatively fast) and add it to the sum of all the values in the x_*.countfiles.

要处理日志“文件”，您只需cat x_*.log通过一些进程管道而不是cat x.log. 要获得“文件”的行数，请wc -l对当前的 x.log（相对较快）执行 a并将其添加到x_*.count文件中所有值的总和中。

Answer 8

回答by jalf

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \ncharacters is almost instantaneous, no matter how it is implemented.

需要时间的是将 40+ MB 加载到内存中。最快的方法是对其进行内存映射，或者将其一次性加载到一个大缓冲区中。一旦你把它放在内存中，\n不管它是如何实现的，遍历数据查找字符的循环几乎是瞬间的。

So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.

所以实际上，最重要的技巧是尽可能快地将文件加载到内存中。最快的方法是将其作为单个操作进行。

Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.

否则，可能存在很多技巧来加速算法。如果只添加行，从未修改或删除行，并且如果您重复读取文件，则可以缓存以前读取的行，下次必须读取文件时，只读取新添加的行。

Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.

或者，您可以维护一个单独的索引文件，显示已知 '\n' 字符的位置，以便可以跳过文件的那些部分。

Reading large amounts of data from the harddrive is slow. There's no way around that.

从硬盘读取大量数据很慢。没有办法解决这个问题。

查找文本行数的最快方法（C++）

提问by systemsfault

采纳答案by systemsfault

回答by Pete Kirkham

回答by Ludwig Weinzierl

回答by Adrian McCarthy

回答by Martin York

回答by user88637

回答by paxdiablo

回答by jalf

相关推荐

最近更新

标签

查找文本行数的最快方法（C++）

提问by systemsfault

采纳答案by systemsfault

回答by Pete Kirkham

回答by Ludwig Weinzierl

回答by Adrian McCarthy

回答by Martin York

回答by user88637

回答by paxdiablo

回答by jalf

相关推荐

C++ 如何使用覆盖执行 boost::filesystem copy_file

C++：二进制 std::string 到十进制

C++ 如何将 const char* 转换为 char*

C++ 随机未在范围内声明

相关推荐

最近更新

标签