Linux C++内存分配机制性能对比(tcmalloc vs. jemalloc)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7852731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)
提问by Shayan Pooya
I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.
我有一个分配大量内存的应用程序,我正在考虑使用比 malloc 更好的内存分配机制。
My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?
我的主要选择是:jemalloc 和 tcmalloc。使用它们中的任何一个比另一个有什么好处吗?
There is a good comparison between some mechanisms (including the author's proprietary mechanism -- lockless) in http://locklessinc.com/benchmarks.shtmland it mentions some pros and cons of each of them.
http://locklessinc.com/benchmarks.shtml 中对一些机制(包括作者的专有机制——lockless)进行了很好的比较, 并提到了每个机制的优缺点。
Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?
鉴于这两种机制都处于活跃状态并不断改进。有没有人对这两者的相对表现有任何见解或经验?
采纳答案by Matthieu M.
If I remember correctly, the main difference was with multi-threaded projects.
如果我没记错的话,主要区别在于多线程项目。
Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:
两个库都试图通过让线程从不同的缓存中选择内存来消除内存获取,但它们有不同的策略:
jemalloc
(used by Facebook) maintains a cache per threadtcmalloc
(from Google) maintains a pool of caches, and threads develop a "natural" affinity for a cache, but may change
jemalloc
(由 Facebook 使用)为每个线程维护一个缓存tcmalloc
(来自谷歌)维护一个缓存池,线程对缓存产生“自然”的亲和力,但可能会改变
This led, once again if I remember correctly, to an important difference in term of thread management.
如果我没记错的话,这再次导致线程管理方面的重要差异。
jemalloc
is faster if threads are static, for example using poolstcmalloc
is faster when threads are created/destructed
jemalloc
如果线程是静态的,则更快,例如使用池tcmalloc
创建/销毁线程时速度更快
There is also the problem that since jemalloc
spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.
还有一个问题是,由于jemalloc
旋转新的缓存以适应新的线程 id,线程突然激增将使您在随后的平静阶段(大部分)拥有空缓存。
As a result, I would recommend tcmalloc
in the general case, and reserve jemalloc
for very specific usages (low variation on the number of threads during the lifetime of the application).
因此,我会tcmalloc
在一般情况下推荐,并保留jemalloc
用于非常特定的用途(在应用程序的生命周期内线程数的变化很小)。
回答by SunfiShie
There's a pretty good discussion about allocators here:
这里有一个关于分配器的很好的讨论:
http://www.reddit.com/r/programming/comments/7o8d9/tcmalloca_faster_malloc_than_glibcs_open_sourced/
回答by Martin
Your post do not mention threading, but before considering mixing C and C++ allocation methods, I would investigate the concept of memory pool.BOOST has a good one.
你的帖子没有提到线程,但在考虑混合使用 C 和 C++ 分配方法之前,我会调查内存池的概念。BOOST 有一个很好的概念。
回答by Basile Starynkevitch
You could also consider using Boehm conservative garbage collector. Basically, you replace every malloc
in your source code with GC_malloc
(etc...), and you don't bother calling free
. Boehm's GC don't allocate memory more quickly than malloc (it is about the same, or can be 30% slower), but it has the advantage to deal with useless memory zones automatically, which might improve your program (and certainly eases coding, since you don't care any more about free). And Boehm's GC can also be usedas a C++ allocator.
您还可以考虑使用Boehm 保守垃圾收集器。基本上,您将malloc
源代码中的each 替换为GC_malloc
(etc...),并且您不必费心调用free
. Boehm 的 GC 不会比 malloc 更快地分配内存(大致相同,或者可能慢 30%),但它具有自动处理无用内存区域的优势,这可能会改进您的程序(当然也可以简化编码,因为你不再关心免费了)。并且 Boehm 的 GC 也可以用作C++ 分配器。
If you really think that malloc
is too slow (but you should benchmark; most malloc
-s take less than microsecond), and if you fully understand the allocating behavior of your program, you might replace some malloc-s with your special allocator (which could, for instance, get memory from the kernel in big chunks using mmap
and manage memory by yourself). But I believe doing that is a pain. In C++ you have the allocatorconcept and std::allocator_traits
, with most standard containerstemplates accepting such an allocator (see also std::allocator
), e.g. the optional second template argument to std::vector
, etc...
如果您真的认为这malloc
太慢了(但您应该进行基准测试;大多数malloc
-s 花费的时间不到微秒),并且如果您完全了解程序的分配行为,您可以用您的特殊分配器替换一些 malloc-s(它可以,例如,使用大块从内核中获取内存mmap
并自行管理内存)。但我相信这样做是一种痛苦。在 C++ 中,您有分配器概念和std::allocator_traits
,大多数标准容器模板都接受这样的分配器(另请参阅std::allocator
),例如可选的第二个模板参数std::vector
,等等...
As others suggested, if you believe malloc
is a bottleneck, you could allocate data in chunks (or using arenas), or just in an array.
正如其他人所建议的,如果您认为这malloc
是一个瓶颈,您可以分块(或使用 arenas)或仅在数组中分配数据。
Sometimes, implementing a specialized copying garbage collector(for some of your data) could help. Consider perhaps MPS.
有时,实现一个专门的复制垃圾收集器(对于您的某些数据)可能会有所帮助。也许考虑MPS。
But don't forget that premature optimization is eviland please benchmark & profile your application to understand exactly where time is lost.
但是不要忘记过早优化是邪恶的,请对您的应用程序进行基准测试和分析,以准确了解时间损失的地方。
回答by Alexey
I have recently considered tcmalloc for a project at work. This is what I observed:
我最近考虑将 tcmalloc 用于工作中的项目。这是我观察到的:
Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.
Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.
大大提高了在多线程设置中大量使用 malloc 的性能。我将它与工作中的工具一起使用,性能提高了近两倍。原因是在这个工具中,有几个线程在关键循环中执行小对象的分配。使用 glibc,性能会受到影响,因为我认为不同线程中 malloc/free 调用之间的锁争用。
不幸的是,tcmalloc 增加了内存占用。我上面提到的工具会消耗两到三倍的内存(以最大驻留集大小衡量)。增加的占用空间对我们来说是行不通的,因为我们实际上正在寻找减少内存占用的方法。
In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)
最后我决定不使用 tcmalloc 而是直接优化应用程序代码:这意味着从内部循环中删除分配以避免 malloc/free 锁争用。(出于好奇,使用一种压缩形式而不是使用内存池。)
The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.
给您的教训是,您应该使用典型的工作负载仔细衡量您的应用程序。如果您负担得起额外的内存使用量,tcmalloc 可能对您很有用。如果没有,tcmalloc 仍然可以帮助您了解避免频繁调用跨线程的内存分配会获得什么。
回答by rogerdpack
Be aware that according to the 'nedmalloc' homepage, modern OS's allocators are actually pretty fast now:
请注意,根据“nedmalloc”主页,现代操作系统的分配器现在实际上非常快:
"Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results"
“Windows 7、Linux 3.x、FreeBSD 8、Mac OS X 10.6 都包含最先进的分配器,在现实世界中,没有第三方分配器可能会显着改进它们”
http://www.nedprod.com/programs/portable/nedmalloc
http://www.nedprod.com/programs/portable/nedmalloc
So you might be able to get away with just recommending your users upgrade or something like it :)
因此,您可能只需推荐您的用户升级或类似的东西就可以逃脱:)