C++ 标准是否要求 iostreams 性能不佳,或者我只是在处理一个糟糕的实现?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4340396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation?
提问by Ben Voigt
Every time I mention slow performance of C++ standard library iostreams, I get met with a wave of disbelief. Yet I have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), and switching from iostreams to OS-specific I/O APIs and custom buffer management does give an order of magnitude improvement.
每次我提到 C++ 标准库 iostreams 的缓慢性能时,我都会遇到一波不相信的事情。然而,我的分析器结果显示在 iostream 库代码上花费了大量时间(完整的编译器优化),并且从 iostream 切换到特定于操作系统的 I/O API 和自定义缓冲区管理确实提供了一个数量级的改进。
What extra work is the C++ standard library doing, is it required by the standard, and is it useful in practice? Or do some compilers provide implementations of iostreams that are competitive with manual buffer management?
C++标准库做了哪些额外的工作,标准是否要求,在实践中是否有用?或者某些编译器是否提供了与手动缓冲区管理相竞争的 iostream 实现?
Benchmarks
基准
To get matters moving, I've written a couple of short programs to exercise the iostreams internal buffering:
为了解决问题,我编写了几个简短的程序来练习 iostreams 的内部缓冲:
- putting binary data into an
ostringstream
http://ideone.com/2PPYw - putting binary data into a
char[]
buffer http://ideone.com/Ni5ct - putting binary data into a
vector<char>
usingback_inserter
http://ideone.com/Mj2Fi - NEW:
vector<char>
simple iterator http://ideone.com/9iitv - NEW: putting binary data directly into
stringbuf
http://ideone.com/qc9QA - NEW:
vector<char>
simple iterator plus bounds check http://ideone.com/YyrKy
- 将二进制数据放入
ostringstream
http://ideone.com/2PPYw - 将二进制数据放入
char[]
缓冲区http://ideone.com/Ni5ct - 将二进制数据放入
vector<char>
使用back_inserter
http://ideone.com/Mj2Fi - 新:
vector<char>
简单的迭代器http://ideone.com/9iitv - 新:将二进制数据直接放入
stringbuf
http://ideone.com/qc9QA - 新:
vector<char>
简单的迭代器加上边界检查http://ideone.com/YyrKy
Note that the ostringstream
and stringbuf
versions run fewer iterations because they are so much slower.
请注意,ostringstream
和stringbuf
版本运行较少的迭代,因为它们非常慢。
On ideone, the ostringstream
is about 3 times slower than std:copy
+ back_inserter
+ std::vector
, and about 15 times slower than memcpy
into a raw buffer. This feels consistent with before-and-after profiling when I switched my real application to custom buffering.
上ideone,所述ostringstream
慢于约3倍std:copy
+ back_inserter
+std::vector
慢,以及约15倍memcpy
到原始缓冲液中。当我将实际应用程序切换到自定义缓冲时,这感觉与前后分析一致。
These are all in-memory buffers, so the slowness of iostreams can't be blamed on slow disk I/O, too much flushing, synchronization with stdio, or any of the other things people use to excuse observed slowness of the C++ standard library iostream.
这些都是内存中的缓冲区,因此不能将 iostream 的缓慢归咎于缓慢的磁盘 I/O、过多的刷新、与 stdio 的同步,或者人们用来借口观察到的 C++ 标准库的缓慢的任何其他事情数据流。
It would be nice to see benchmarks on other systems and commentary on things common implementations do (such as gcc's libc++, Visual C++, Intel C++) and how much of the overhead is mandated by the standard.
很高兴看到其他系统上的基准测试和对常见实现所做的事情的评论(例如 gcc 的 libc++、Visual C++、Intel C++)以及标准规定的开销有多少。
Rationale for this test
本次测试的基本原理
A number of people have correctly pointed out that iostreams are more commonly used for formatted output. However, they are also the only modern API provided by the C++ standard for binary file access. But the real reason for doing performance tests on the internal buffering applies to the typical formatted I/O: if iostreams can't keep the disk controller supplied with raw data, how can they possibly keep up when they are responsible for formatting as well?
许多人正确地指出,iostreams 更常用于格式化输出。但是,它们也是 C++ 标准为二进制文件访问提供的唯一现代 API。但是对内部缓冲进行性能测试的真正原因适用于典型的格式化 I/O:如果 iostreams 不能让磁盘控制器提供原始数据,那么当它们负责格式化时,它们怎么可能跟上?
Benchmark Timing
基准时序
All these are per iteration of the outer (k
) loop.
所有这些都是外部 ( k
) 循环的每次迭代。
On ideone (gcc-4.3.4, unknown OS and hardware):
在 ideone(gcc-4.3.4,未知操作系统和硬件)上:
ostringstream
: 53 millisecondsstringbuf
: 27 msvector<char>
andback_inserter
: 17.6 msvector<char>
with ordinary iterator: 10.6 msvector<char>
iterator and bounds check: 11.4 mschar[]
: 3.7 ms
ostringstream
: 53 毫秒stringbuf
: 27 毫秒vector<char>
和back_inserter
:17.6 毫秒vector<char>
使用普通迭代器:10.6 毫秒vector<char>
迭代器和边界检查:11.4 毫秒char[]
: 3.7 毫秒
On my laptop (Visual C++ 2010 x86, cl /Ox /EHsc
, Windows 7 Ultimate 64-bit, Intel Core i7, 8 GB RAM):
在我的笔记本电脑上(Visual C++ 2010 x86、cl /Ox /EHsc
Windows 7 Ultimate 64 位、Intel Core i7、8 GB RAM):
ostringstream
: 73.4 milliseconds, 71.6 msstringbuf
: 21.7 ms, 21.3 msvector<char>
andback_inserter
: 34.6 ms, 34.4 msvector<char>
with ordinary iterator: 1.10 ms, 1.04 msvector<char>
iterator and bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 mschar[]
: 1.48 ms, 1.57 ms
ostringstream
:73.4 毫秒,71.6 毫秒stringbuf
: 21.7 毫秒、21.3 毫秒vector<char>
和back_inserter
:34.6 毫秒、34.4 毫秒vector<char>
使用普通迭代器:1.10 毫秒、1.04 毫秒vector<char>
迭代器和边界检查:1.11 ms、0.87 ms、1.12 ms、0.89 ms、1.02 ms、1.14 mschar[]
: 1.48 毫秒、1.57 毫秒
Visual C++ 2010 x86, with Profile-Guided Optimization cl /Ox /EHsc /GL /c
, link /ltcg:pgi
, run, link /ltcg:pgo
, measure:
VISUAL C ++ 2010 x86上,与档案导引优化cl /Ox /EHsc /GL /c
,link /ltcg:pgi
,运行,link /ltcg:pgo
,措施:
ostringstream
: 61.2 ms, 60.5 msvector<char>
with ordinary iterator: 1.04 ms, 1.03 ms
ostringstream
: 61.2 毫秒、60.5 毫秒vector<char>
使用普通迭代器:1.04 毫秒、1.03 毫秒
Same laptop, same OS, using cygwin gcc 4.3.4 g++ -O3
:
相同的笔记本电脑,相同的操作系统,使用 cygwin gcc 4.3.4 g++ -O3
:
ostringstream
: 62.7 ms, 60.5 msstringbuf
: 44.4 ms, 44.5 msvector<char>
andback_inserter
: 13.5 ms, 13.6 msvector<char>
with ordinary iterator: 4.1 ms, 3.9 msvector<char>
iterator and bounds check: 4.0 ms, 4.0 mschar[]
: 3.57 ms, 3.75 ms
ostringstream
: 62.7 毫秒、60.5 毫秒stringbuf
: 44.4 毫秒、44.5 毫秒vector<char>
和back_inserter
:13.5 毫秒、13.6 毫秒vector<char>
使用普通迭代器:4.1 毫秒、3.9 毫秒vector<char>
迭代器和边界检查:4.0 毫秒、4.0 毫秒char[]
: 3.57 毫秒、3.75 毫秒
Same laptop, Visual C++ 2008 SP1, cl /Ox /EHsc
:
同一台笔记本电脑时,Visual C ++ 2008 SP1, cl /Ox /EHsc
:
ostringstream
: 88.7 ms, 87.6 msstringbuf
: 23.3 ms, 23.4 msvector<char>
andback_inserter
: 26.1 ms, 24.5 msvector<char>
with ordinary iterator: 3.13 ms, 2.48 msvector<char>
iterator and bounds check: 2.97 ms, 2.53 mschar[]
: 1.52 ms, 1.25 ms
ostringstream
: 88.7 毫秒、87.6 毫秒stringbuf
: 23.3 毫秒、23.4 毫秒vector<char>
和back_inserter
:26.1 毫秒、24.5 毫秒vector<char>
使用普通迭代器:3.13 毫秒、2.48 毫秒vector<char>
迭代器和边界检查:2.97 毫秒、2.53 毫秒char[]
: 1.52 毫秒、1.25 毫秒
Same laptop, Visual C++ 2010 64-bit compiler:
同一台笔记本电脑,Visual C++ 2010 64 位编译器:
ostringstream
: 48.6 ms, 45.0 msstringbuf
: 16.2 ms, 16.0 msvector<char>
andback_inserter
: 26.3 ms, 26.5 msvector<char>
with ordinary iterator: 0.87 ms, 0.89 msvector<char>
iterator and bounds check: 0.99 ms, 0.99 mschar[]
: 1.25 ms, 1.24 ms
ostringstream
: 48.6 毫秒、45.0 毫秒stringbuf
: 16.2 毫秒、16.0 毫秒vector<char>
和back_inserter
:26.3 毫秒、26.5 毫秒vector<char>
使用普通迭代器:0.87 毫秒、0.89 毫秒vector<char>
迭代器和边界检查:0.99 毫秒、0.99 毫秒char[]
: 1.25 毫秒、1.24 毫秒
EDIT: Ran all twice to see how consistent the results were. Pretty consistent IMO.
编辑:运行两次以查看结果的一致性。非常一致的 IMO。
NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream
and vector
reallocation, which takes place only on the first pass, should have little impact on the final results.
注意:在我的笔记本电脑上,由于我可以节省比 ideone 允许的更多 CPU 时间,因此我将所有方法的迭代次数设置为 1000。这意味着仅在第一次通过时发生的ostringstream
和vector
重新分配对最终结果几乎没有影响。
EDIT: Oops, found a bug in the vector
-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char>
was outperforming char[]
. It didn't make much difference though, vector<char>
is still faster than char[]
under VC++ 2010.
编辑:糟糕,在vector
-with-ordinary-iterator 中发现了一个错误,迭代器没有被提前,因此缓存命中太多。我想知道vector<char>
表现如何char[]
。虽然它没有太大区别,但vector<char>
仍然比char[]
VC++ 2010 下更快。
Conclusions
结论
Buffering of output streams requires three steps each time data is appended:
每次附加数据时,输出流的缓冲需要三个步骤:
- Check that the incoming block fits the available buffer space.
- Copy the incoming block.
- Update the end-of-data pointer.
- 检查传入的块是否适合可用的缓冲区空间。
- 复制传入块。
- 更新数据结束指针。
The latest code snippet I posted, "vector<char>
simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.
我发布的最新代码片段“vector<char>
简单迭代器加边界检查”不仅执行此操作,还分配额外空间并在传入块不适合时移动现有数据。正如 Clifford 指出的那样,在文件 I/O 类中缓冲不必这样做,它只会刷新当前缓冲区并重用它。所以这应该是缓冲输出成本的上限。这正是制作可工作的内存缓冲区所需要的。
So why is stringbuf
2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.
那么为什么stringbuf
在 ideone 上慢 2.5 倍,而在我测试时至少慢 10 倍?在这个简单的微基准测试中它没有被多态地使用,所以这并没有解释它。
回答by beldaz
Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performancehas an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):
没有像标题那样回答您问题的具体细节:2006年 C++ 性能技术报告有一个有趣的部分关于 IOStreams(第 68 页)。与您的问题最相关的是第 6.1.2 节(“执行速度”):
Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.
由于 IOStreams 处理的某些方面分布在多个方面,因此该标准似乎要求实施效率低下。但事实并非如此——通过使用某种形式的预处理,可以避免大部分工作。使用比通常使用的链接器稍微智能一点的链接器,可以消除其中的一些低效率。这在 §6.2.3 和 §6.2.5 中讨论。
Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.
由于该报告是在 2006 年编写的,人们希望许多建议能够被纳入当前的编译器,但也许事实并非如此。
As you mention, facets may not feature in write()
(but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream
code compiled with GCC gives the following breakdown:
正如您所提到的,facet 可能没有特色write()
(但我不会盲目地假设)。那么有什么特点呢?在ostringstream
使用 GCC 编译的代码上运行 GProf会给出以下细分:
- 44.23% in
std::basic_streambuf<char>::xsputn(char const*, int)
- 34.62% in
std::ostream::write(char const*, int)
- 12.50% in
main
- 6.73% in
std::ostream::sentry::sentry(std::ostream&)
- 0.96% in
std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
- 0.96% in
std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
- 0.00% in
std::fpos<int>::fpos(long long)
- 44.23% 在
std::basic_streambuf<char>::xsputn(char const*, int)
- 34.62%
std::ostream::write(char const*, int)
- 12.50%
main
- 6.73%
std::ostream::sentry::sentry(std::ostream&)
- 0.96%
std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
- 0.96%
std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
- 0.00% 在
std::fpos<int>::fpos(long long)
So the bulk of the time is spent in xsputn
, which eventually calls std::copy()
after lots of checking and updating of cursor positions and buffers (have a look in c++\bits\streambuf.tcc
for the details).
所以大部分时间都花在xsputn
,它最终std::copy()
在大量检查和更新游标位置和缓冲区后调用(查看c++\bits\streambuf.tcc
详细信息)。
My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write
was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.
我对此的看法是,您已经专注于最坏的情况。如果您正在处理相当大的数据块,则执行的所有检查将是完成的总工作的一小部分。但是您的代码一次以四个字节为单位移动数据,并且每次都会产生所有额外成本。很明显,在现实生活中会避免这样做 - 考虑一下如果write
在 1m 个 int 数组上调用而不是在一个 int 上调用 1m 次,惩罚将是微不足道的。在现实生活中,人们会真正欣赏 IOStreams 的重要特性,即其内存安全和类型安全设计。这样的好处是有代价的,您编写了一个测试,使这些成本支配了执行时间。
回答by Ben Voigt
I'm rather disappointed in the Visual Studio users out there, who rather had a gimme on this one:
我对那里的 Visual Studio 用户感到相当失望,他们宁愿在这个问题上给个gimme:
- In the Visual Studio implementation of
ostream
, thesentry
object (which is required by the standard) enters a critical section protecting thestreambuf
(which is not required). This doesn't seem to be optional, so you pay the cost of thread synchronization even for a local stream used by a single thread, which has no need for synchronization.
- 在 的 Visual Studio 实现中
ostream
,sentry
对象(标准要求)进入保护 的临界区streambuf
(这不是必需的)。这似乎不是可选的,因此即使对于不需要同步的单个线程使用的本地流,您也需要支付线程同步的成本。
This hurts code that uses ostringstream
to format messages pretty severely. Using the stringbuf
directly avoids the use of sentry
, but the formatted insertion operators can't work directly on streambuf
s. For Visual C++ 2010, the critical section is slowing down ostringstream::write
by a factor of three vs the underlying stringbuf::sputn
call.
这会严重损害ostringstream
用于格式化消息的代码。stringbuf
直接使用可以避免使用sentry
,但是格式化插入操作符不能直接作用于streambuf
s。对于 Visual C++ 2010,关键部分ostringstream::write
比底层stringbuf::sputn
调用慢了三倍。
Looking at beldaz's profiler data on newlib, it seems clear that gcc's sentry
doesn't do anything crazy like this. ostringstream::write
under gcc only takes about 50% longer than stringbuf::sputn
, but stringbuf
itself is much slower than under VC++. And both still compare very unfavorably to using a vector<char>
for I/O buffering, although not by the same margin as under VC++.
查看beldaz 在 newlib 上的分析器数据,似乎很明显 gccsentry
不会做任何像这样疯狂的事情。 ostringstream::write
在 gcc 下只需要比 长约 50% stringbuf::sputn
,但stringbuf
它本身比在 VC++ 下慢得多。与使用 avector<char>
进行 I/O 缓冲相比,两者仍然非常不利,尽管与 VC++ 下的差距不大。
回答by Roddy
The problem you see is all in the overhead around each call to write(). Each level of abstraction that you add (char[] -> vector -> string -> ostringstream) adds a few more function call/returns and other housekeeping guff that - if you call it a million times - adds up.
您看到的问题全在于每次调用 write() 的开销。您添加的每个抽象级别 (char[] -> vector -> string -> ostringstream) 都会添加更多的函数调用/返回和其他内务管理 - 如果你调用它一百万次 - 加起来。
I modified two of the examples on ideone to write ten ints at a time. The ostringstream time went from 53 to 6 ms (almost 10 x improvement) while the char loop improved (3.7 to 1.5) - useful, but only by a factor of two.
我修改了ideone上的两个例子,一次写十个整数。ostringstream 时间从 53 毫秒变为 6 毫秒(几乎提高了 10 倍),而字符循环得到了改善(3.7 到 1.5)——有用,但只提高了两倍。
If you're that concerned about performance then you need to choose the right tool for the job. ostringstream is useful and flexible, but there's a penalty for using it the way you're trying to. char[] is harder work, but the performance gains can be great (remember the gcc will probably inline the memcpys for you as well).
如果您非常关心性能,那么您需要为工作选择合适的工具。ostringstream 有用且灵活,但以您尝试的方式使用它会受到惩罚。char[] 是更难的工作,但性能提升可能很大(请记住,gcc 可能也会为您内联 memcpys)。
In short, ostringstream isn't broken, but the closer you get to the metal the faster your code will run. Assembler still has advantages for some folk.
简而言之,ostringstream 并没有被破坏,但是您离金属越近,您的代码运行得越快。对于某些人来说,汇编程序仍然具有优势。
回答by Clifford
To get better performance you have to understand how the containers you are using work. In your char[] array example, the array of the required size is allocated in advance. In your vector and ostringstream example you are forcing the objects to repeatedly allocate and reallocate and possibly copy data many times as the object grows.
为了获得更好的性能,您必须了解您使用的容器是如何工作的。在您的 char[] 数组示例中,预先分配了所需大小的数组。在您的 vector 和 ostringstream 示例中,您正在强制对象随着对象的增长而重复分配和重新分配并可能多次复制数据。
With std::vector this is easly resolved by initialising the size of the vector to the final size as you did the char array; instead you rather unfairly cripple the performance by resizing to zero! That is hardly a fair comparison.
使用 std::vector 可以很容易地通过将向量的大小初始化为最终大小来解决这个问题,就像你对 char 数组所做的那样;相反,您通过将大小调整为零而不公平地削弱了性能!这几乎不是一个公平的比较。
With respect to ostringstream, preallocating the space is not possible, I would suggest that it is an inappropruate use. The class has far greater utility than a simple char array, but if you don't need that utility, then don't use it, because you will pay the overhead in any case. Instead it should be used for what it is good for - formatting data into a string. C++ provides a wide range of containers and an ostringstram is amongst the least appropriate for this purpose.
关于 ostringstream,预分配空间是不可能的,我认为这是一种不恰当的使用。该类比简单的 char 数组具有更大的效用,但如果您不需要该效用,则不要使用它,因为无论如何您都将支付开销。相反,它应该用于它的好处 - 将数据格式化为字符串。C++ 提供了范围广泛的容器,而 ostringstram 是最不适合此目的的容器。
In the case of the vector and ostringstream you get protection from buffer overrun, you don't get that with a char array, and that protection does not come for free.
在 vector 和 ostringstream 的情况下,您可以防止缓冲区溢出,而使用 char 数组则无法获得,并且这种保护不是免费的。