在 C++ 中快速读取文本文件

Question

提问by Arne

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?

我目前正在用 C++ 编写一个程序，其中包括读取大量大文本文件。每个都有 ~400.000 行，在极端情况下每行 4000 个或更多字符。只是为了测试，我使用 ifstream 和 cplusplus.com 提供的实现读取了其中一个文件。花了大约 60 秒，这太长了。现在我想知道，有没有直接的方法来提高阅读速度？

edit: The code I am using is more or less this:

编辑：我使用的代码或多或少是这样的：

string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
    while(txtFile.good())
    {
        m_numLines++;
        getline(txtFile, tmpString);
    }
    txtFile.close();
}

edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.

编辑 2：我读取的文件只有 82 MB 大。我主要说它可以达到4000，因为我认为可能需要知道才能进行缓冲。

edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.

编辑 3：谢谢大家的回答，但鉴于我的问题，似乎没有太大的改进空间。我必须使用 readline，因为我想计算行数。将 ifstream 实例化为二进制也不会使读取速度更快。我会尽量并行化它，这至少应该有效。

edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

编辑 4：显然我可以做一些事情。非常感谢 sehe 在这方面投入了这么多时间，我非常感谢！=)

Answer 1

回答by sehe

Updates:Be sure to check the (surprising) updates below the initial answer

更新：请务必检查初始答案下方的（令人惊讶的）更新

Memory mapped files have served me well¹:

内存映射文件对我很有帮助¹：

#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm>  // for std::find
#include <iostream>   // for std::cout
#include <cstring>

int main()
{
    boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

This should be rather quick.

这应该是相当快的。

Update

更新

In case it helps you test this approach, here's a version using mmapdirectly instead of using Boost: see it live on Coliru

如果它可以帮助您测试这种方法，这里有一个直接使用mmap而不是使用 Boost的版本：在 Coliru 上实时查看

#include <algorithm>
#include <iostream>
#include <cstring>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

const char* map_file(const char* fname, size_t& length);

int main()
{
    size_t length;
    auto f = map_file("test.cpp", length);
    auto l = f + length;

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

void handle_error(const char* msg) {
    perror(msg); 
    exit(255);
}

const char* map_file(const char* fname, size_t& length)
{
    int fd = open(fname, O_RDONLY);
    if (fd == -1)
        handle_error("open");

    // obtain file size
    struct stat sb;
    if (fstat(fd, &sb) == -1)
        handle_error("fstat");

    length = sb.st_size;

    const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
    if (addr == MAP_FAILED)
        handle_error("mmap");

    // TODO close fd at some point in time, call munmap(...)
    return addr;
}

Update

更新

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wcruns in about 84% of the timetaken with the memory mapped file above:

通过查看 GNU coreutils 的来源，我发现了我可以从中挤出的最后一点性能wc。令我惊讶的是，使用以下（大大简化的）代码改编自wc运行上述内存映射文件所花费的时间约为 84%：

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

¹see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

¹参见这里的基准测试：How to parse space-separated floats in C++ quick?

Answer 2

回答by Louis Ricci

4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.

4000 * 400,000 = 1.6 GB 如果您的硬盘驱动器不是 SSD，那么您可能会获得 ~100 MB/s 的顺序读取。仅在 I/O 中就是 16 秒。

Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.

由于您没有详细说明您使用的特定代码或您需要如何解析这些文件（您是否需要逐行读取它，系统是否有大量 RAM，您可以将整个文件读入一个大 RAM 缓冲区吗？然后解析它？）你几乎无法加快这个过程。

Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.

内存映射文件在顺序读取文件时不会提供任何性能改进。也许手动解析新行的大块而不是使用“getline”会提供改进。

EDITAfter doing some learning (thanks @sehe). Here's the memory mapped solution I would likely use.

编辑经过一些学习（感谢@sehe）。这是我可能会使用的内存映射解决方案。

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main() {
    char* fName = "big.txt";
    //
    struct stat sb;
    long cntr = 0;
    int fd, lineLen;
    char *data;
    char *line;
    // map the file
    fd = open(fName, O_RDONLY);
    fstat(fd, &sb);
    //// int pageSize;
    //// pageSize = getpagesize();
    //// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
    data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    line = data;
    // get lines
    while(cntr < sb.st_size) {
        lineLen = 0;
        line = data;
        // find the next line
        while(*data != '\n' && cntr < sb.st_size) {
            data++;
            cntr++;
            lineLen++;
        }
        /***** PROCESS LINE *****/
        // ... processLine(line, lineLen);
    }
    return 0;
}

Answer 3

回答by user2434119

Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.

尼尔柯克，不幸的是我无法回复您的评论（没有足够的声誉），但我对 ifstream 和 stringstream 进行了性能测试，并且逐行读取文本文件的性能完全相同。

std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}

This takes 1426ms on a 106MB file.

这在 106MB 文件上需要 1426 毫秒。

std::ifstream stream;
std::string line;
while(ifstream.good()) {
    getline(stream, line);
}

This takes 1433ms on the same file.

在同一个文件上这需要 1433 毫秒。

The following code is faster instead:

以下代码速度更快：

const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}

This takes 884ms on the same file. It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).

这在同一个文件上需要 884 毫秒。这只是有点棘手，因为您必须设置缓冲区的最大大小（即输入文件中每一行的最大长度）。

Answer 4

回答by utnapistim

Do you have to read all files at the same time? (at the start of your application for example)

您是否必须同时读取所有文件？（例如在您的应用程序开始时）

If you do, consider parallelizing the operation.

如果这样做，请考虑并行化操作。

Either way, consider using binary streams, or unbffered readfor blocks of data.

无论哪种方式，请考虑使用二进制流，或对数据块进行无缓冲读取。

Answer 5

回答by Jo So

As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdioversions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false)but I don't know if it's as fast as unlocked_stdio.

作为对竞争性编程有一点背景的人，我可以告诉你：至少对于像整数解析这样的简单事情，C 中的主要成本是锁定文件流（默认情况下，这是为多线程完成的）。改用unlocked_stdio版本 ( fgetc_unlocked(), fread_unlocked())。对于 C++，常见的知识是使用，std::ios::sync_with_stdio(false)但我不知道它是否和unlocked_stdio.

For reference here is my standard integer parsing code. It's a lotfaster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.

参考这里是我的标准整数解析代码。这是一个很多比scanf的速度更快，因为我说主要是由于未锁定的流。对我来说，它与我以前使用过的最好的手工编码 mmap 或自定义缓冲版本一样快，而且没有疯狂的维护债务。

int readint(void)
{
        int n, c;
        n = getchar_unlocked() - '0';
        while ((c = getchar_unlocked()) > ' ')
                n = 10*n + c-'0';
        return n;
}

(Note: This one only works if there is precisely one non-digit character between any two integers).

（注意：这个只有在任意两个整数之间正好有一个非数字字符时才有效）。

And of course avoid memory allocation if possible...

当然，如果可能的话，避免内存分配......

Answer 6

回答by Shumail

Use Random file accessor use binary mode. for sequential, this is big but still it depends on what you are reading.

使用Random file access或使用binary mode. 对于顺序，这很大，但仍然取决于您正在阅读的内容。

在 C++ 中快速读取文本文件

提问by Arne

回答by sehe

Update

更新

Update

更新

回答by Louis Ricci

回答by user2434119

回答by utnapistim

回答by Jo So

回答by Shumail

相关推荐

最近更新

标签

在 C++ 中快速读取文本文件

提问by Arne

回答by sehe

Update

更新

Update

更新

回答by Louis Ricci

回答by user2434119

回答by utnapistim

回答by Jo So

回答by Shumail

相关推荐

构造函数可以在 C++ 中调用另一个构造函数吗？

C++ 找不到 /usr/local/lib 中的库

C++ 如何在C++中使用类对象作为函数参数

C++ 为什么我应该使用“using”关键字来访问我的基类方法？

相关推荐

最近更新

标签