C++ 读取输入文件,最快的方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6755111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
read input files, fastest way possible?
提问by Kiarash
I have numerous text files of data in the form of float numbers. I'm looking for the fastest way to read them in C++. I can change the file to binary if that's the fastest.
我有许多浮点数形式的数据文本文件。我正在寻找用 C++ 读取它们的最快方法。如果这是最快的,我可以将文件更改为二进制文件。
It would be great if you could give me hint or refer me to a website with complete explanation. I don't know whether there is any library that does the work fast. Even if there is any open source software that does the work, that would be helpful.
如果您能给我提示或将我推荐给一个有完整解释的网站,那就太好了。我不知道是否有任何库可以快速完成工作。即使有任何开源软件可以完成这项工作,这也会有所帮助。
回答by Matteo Italia
Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read
in a single operation (which is very fast), but you can even map the file in memory if your OS supports it; you can use open
/mmap
on POSIX systems, CreateFile
/CreateFileMapping
/MapViewOfFile
on Windows, or even the Boost cross-platform solution (thanks @Cory Nelson for pointing it out).
拥有二进制文件是最快的选择。您不仅可以istream::read
在单个操作中使用 raw 直接在数组中读取它(非常快),而且如果您的操作系统支持,您甚至可以将文件映射到内存中;您可以使用open
/mmap
在POSIX系统中,CreateFile
/ CreateFileMapping
/ MapViewOfFile
Windows上,甚至加速跨平台解决方案(感谢@Cory尼尔森指点出来)。
Quick & dirty examples, assuming the file contains the raw representation of some float
s:
快速和肮脏的例子,假设文件包含一些float
s的原始表示:
"Normal" read:
“正常”阅读:
#include <fstream>
#include <vector>
// ...
// Open the stream
std::ifstream is("input.dat");
// Determine the file length
is.seekg(0, std::ios_base::end);
std::size_t size=is.tellg();
is.seekg(0, std::ios_base::beg);
// Create a vector to store the data
std::vector<float> v(size/sizeof(float));
// Load the data
is.read((char*) &v[0], size);
// Close the file
is.close();
Using shared memory:
使用共享内存:
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
using boost::interprocess;
// ....
// Create the file mapping
file_mapping fm("input.dat", read_only);
// Map the file in memory
mapped_region region(fm, read_only);
// Get the address where the file has been mapped
float * addr = (float *)region.get_address();
std::size_t elements = region.get_size()/sizeof(float);
回答by Thomas Matthews
Your bottleneck is in the I/O. You want the program to read in as much data into memory in fewest I/O calls. For example reading 256 numbers with one fread
is faster than 256 fread
of one number.
您的瓶颈在于 I/O。您希望程序以最少的 I/O 调用将尽可能多的数据读入内存。例如,用一个读取 256 个数字fread
比用一个数字读取 256 个要快fread
。
If you can, format the data file to match the target platform's internal floating point representation, or at least your program's representation. This reduces the overhead of translating textual representation to internal representation.
如果可以,格式化数据文件以匹配目标平台的内部浮点表示,或者至少是您程序的表示。这减少了将文本表示转换为内部表示的开销。
Bypass the OS and use the DMA controller to read in the file data, if possible. The DMA chip takes the burden of reading data into memory off the shoulders of the processor.
如果可能,绕过操作系统并使用 DMA 控制器读取文件数据。DMA 芯片从处理器的肩上减轻了将数据读入内存的负担。
Compact you data file. The data file wants to be in one contiguous set of sectors on the disk. This will reduce the amount of time spent seeking to different areas on the physical platters.
压缩你的数据文件。数据文件希望位于磁盘上一组连续的扇区中。这将减少寻找物理盘片上不同区域所花费的时间。
Have you program demand exclusive control over the disk resource and the processors. Block all other unimportant tasks; raise the priority of your program's execution.
您的程序是否需要对磁盘资源和处理器的独占控制。阻止所有其他不重要的任务;提高程序执行的优先级。
Use multiple buffers to keep the disk drive spinning. A large portion of time is spent waiting for the hard drive to accelerate and decelerate. Your program can be processing the data while something else is storing the data into a buffer, which leads to ...
使用多个缓冲区来保持磁盘驱动器旋转。大部分时间都花在等待硬盘加速和减速上。您的程序可能正在处理数据,而其他东西正在将数据存储到缓冲区中,这会导致...
Multi-thread. Create one thread to read in the data and alert the processing task when the buffer is not empty.
多线程。创建一个线程来读入数据并在缓冲区不为空时提醒处理任务。
These should keep you busy for a while. All other optimizations will result in negligible performance gains. (Such as accessing the hard drive controller directly to transfer into one of your buffers.)
这些应该会让你忙碌一段时间。所有其他优化将导致可以忽略不计的性能提升。(例如直接访问硬盘控制器以传输到您的缓冲区之一。)
回答by Brian Ng
Another attention to compile mode. I have tried parsing a file with 1M lines. Debug mode consumed 50secs to parse data and append to my container. Release mode consumed at least ten times faster, about 4secs. The code below is to read the whole file before using istringstream to parse the data as 2D points (,).
另外要注意编译模式。我试过用 1M 行解析一个文件。调试模式花费 50 秒来解析数据并附加到我的容器。释放模式消耗至少快十倍,大约 4 秒。下面的代码是在使用istringstream 将数据解析为2D 点(,) 之前读取整个文件。
vector <float> in_data;
string raw_data;
ifstream ifs;
ifs.open(_file_in.c_str(), ios::binary);
ifs.seekg(0, ios::end);
long length = ifs.tellg();
ifs.seekg(0, ios::beg);
char * buffer;
buffer = new char[length];
ifs.read(buffer, length);
raw_data = buffer;
ifs.close();
delete[]buffer;
cout << "Size: " << raw_data.length()/1024/1024.0 << "Mb" << endl;
istringstream _sstr(raw_data);
string _line;
while (getline(_sstr, _line)){
istringstream _ss(_line);
vector <float> record;
//maybe using boost/Tokenizer is a good idea ...
while (_ss)
{
string s;
if (!getline(_ss, s, ',')) break;
record.push_back(atof(s.c_str()));
}
in_data.push_back(record[0]);
}