C++ boost::flat_map 及其与 map 和 unordered_map 相比的性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21166675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 23:31:11  来源:igfitidea点击:

boost::flat_map and its performance compared to map and unordered_map

c++boostmap

提问by naumcho

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_mapwhich is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_mapso I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?

编程中的常识是,由于缓存命中,内存局部性可以大大提高性能。我最近发现boost::flat_map哪个是基于矢量的地图实现。它似乎不像您的典型map/那样受欢迎,unordered_map所以我找不到任何性能比较。它如何比较以及它的最佳用例是什么?

Thanks!

谢谢!

回答by v.oddou

I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

我最近在我的公司对不同的数据结构运行了一个基准测试,所以我觉得我需要放弃一个词。正确地对某些东西进行基准测试是非常复杂的。

Benchmarking

基准测试

On the web we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

在网络上,我们很少找到(如果有的话)精心设计的基准测试。直到今天,我只找到了以记者的方式完成的基准测试(很快,并在地毯下扫描了数十个变量)。

1)You need to consider about cache warming

1)你需要考虑缓存预热

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

大多数运行基准测试的人都害怕计时器差异,因此他们运行他们的东西数千次并花费全部时间,他们只是小心翼翼地为每次操作采取相同的数千次,然后认为它具有可比性。

The truth is, in real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describinghow to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

事实是,在现实世界中它没有意义,因为您的缓存不会是热的,并且您的操作可能只会被调用一次。因此,您需要使用 RDTSC 进行基准测试,并且只计算一次调用它们的时间。Intel有一篇论文描述了如何使用RDTSC(使用cpuid指令刷新管道,并在程序开始时至少调用3次以使其稳定)。

2)RDTSC accuracy measure

2)RDTSC 精度测量

I also recommend doing this:

我还建议这样做:

u64 g_correctionFactor;  // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);

    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = std::min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = std::max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks\n", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid to get a -10**18 (64 bits first negatives values) from time to time.

这是一个差异测量器,它将取所有测量值中的最小值,以避免不时得到 -10**18(64 位第一个负值)。

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real world stuff, especially when calling stuff just once. So an intrinsic is suited here, because it doesn't break the compiler free-re-ordering of instructions.

请注意使用内部函数而不是内联汇编。现在编译器很少支持第一个内联汇编,但更糟糕的是,编译器在内联汇编周围创建了一个完整的排序障碍,因为它不能静态分析内部,所以这是对现实世界的东西进行基准测试的问题,尤其是在调用东西时一次。因此,内在函数适合这里,因为它不会破坏编译器对指令的自由重新排序。

3)parameters

3)参数

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:

最后一个问题是人们通常测试场景的变化太少。容器性能受以下因素影响:

  1. Allocator
  2. size of contained type
  3. cost of implementation of copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD
  1. 分配器
  2. 包含类型的大小
  3. 所包含类型的复制操作、赋值操作、移动操作、构造操作的实现成本。
  4. 容器中的元素数量(问题的大小)
  5. 类型有简单的 3.-operations
  6. 类型是 POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user defined operation, like pool allocation or freelist or other...

第 1 点很重要,因为容器会不时分配,如果它们使用 CRT“新”或某些用户定义的操作(如池分配或空闲列表或其他...

(for people interested about pt 1, join the mystery thread on gamedevabout system allocator performance impact)

对于第 1 点感兴趣的人,请加入关于系统分配器性能影响的gamedev 神秘线程

Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.

第 2 点是因为一些容器(比如 A)会浪费时间复制东西,而且类型越大开销越大。问题在于,与另一个容器 B 相比,A 可能会在小型类型上胜过 B,而在较大类型上输掉。

Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.

第 3 点与第 2 点相同,只是它将成本乘以某个加权因子。

Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for small number of types (like mapvs. vector, because their cache locality is good, but mapfragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.

第 4 点是大 O 与缓存问题混合的问题。对于少量类型(例如mapvs. vector,因为它们的缓存位置很好,但会map碎片化内存),一些复杂度差的容器可以在很大程度上优于低复杂度容器。然后在某个交叉点,它们将失败,因为包含的整体大小开始“泄漏”到主内存并导致缓存未命中,再加上可以开始感觉到渐近复杂性的事实。

Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations, because the containers are templated, therefore each type will have its own performance profile.

第 5 点是关于编译器能够在编译时忽略空的或微不足道的东西。这可以大大优化一些操作,因为容器是模板化的,因此每种类型都有自己的性能配置文件。

Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

第 6 点与第 5 点相同,POD 可以受益于复制构造只是一个 memcpy,并且某些容器可以针对这些情况进行特定实现,使用部分模板特化或 SFINAE 根据 T 的特征选择算法。

About the flat map

关于平面地图

Apparently the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

显然,平面地图是一个排序的向量包装器,就像 Loki AssocVector,但是随着 C++11 的一些补充现代化,利用移动语义来加速单个元素的插入和删除。

This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...

这仍然是一个有序的容器。大多数人通常不需要订购部分,因此存在unordered...

Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_mapor something like that—an open address hash map.

你有没有考虑过,也许你需要一个flat_unorderedmap?这将是类似google::sparse_map或类似的东西 - 一个开放的地址哈希映射。

The problem of open address hash maps is that at the time of rehashthey have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

开放地址哈希映射的问题在于,rehash他们必须将周围的所有内容复制到新的扩展平面上,而标准的无序映射只需重新创建哈希索引,而分配的数据则保持原样。缺点当然是内存碎片化如地狱。

The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.

开放地址哈希映射中重新哈希的标准是当容量超过桶向量的大小乘以负载因子时。

A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilonthis will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

一个典型的负载系数是0.8;因此,您需要注意这一点,如果您可以在填充之前预先调整散列映射的大小,请始终预先调整大小:intended_filling * (1/0.8) + epsilon这将保证您永远不必在填充过程中虚假地重新散列和重新复制所有内容。

The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.

封闭地址映射 ( std::unordered..)的优点是您不必关心这些参数。

But the boost::flat_mapis an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

boost::flat_map是一个有序向量;因此,它将始终具有 log(N) 渐近复杂度,这不如开放地址哈希映射(摊销常数时间)好。你也应该考虑一下。

Benchmark results

基准测试结果

This is a test involving different maps (with intkey and __int64/somestructas value) and std::vector.

这是一个涉及不同映射(以int键和__int64/somestruct作为值)和std::vector.

tested types information:

测试类型信息:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

Insertion

插入

EDIT:

编辑:

My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test: random insert 100

我之前的结果包括一个错误:他们实际上测试了有序插入,这对平面地图表现出非常快的行为。
我稍后将这些结果留在本页下方,因为它们很有趣。
这是正确的测试: 随机插入 100

random insert 10000

随机插入 10000

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

我已经检查了实现,这里的平面地图中没有实现延迟排序之类的东西。每个插入都在运行中排序,因此这个基准表现出渐近的趋势:

map : O(N * log(N))
hashmaps : O(N)
vector and flatmaps : O(N * N)

地图:O(N * log(N))
哈希图:O(N)
向量和平面图:O(N * N)

Warning: hereafter the 2 tests for std::mapand both flat_maps are buggyand actually test ordered insertion(vs random insertion for other containers. yes it's confusing sorry):
mixed insert of 100 elements without reservation

警告:以下2个试验std::map和两个flat_maps的越野车,实际上测试命令插入(VS随机插入其他容器是它的混乱很抱歉。):
mixed insert of 100 elements without reservation

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map(therefore 160% of the optimal).

我们可以看到有序插入会导致反向推送,并且速度非常快。然而,从我的基准测试的非图表结果来看,我也可以说这还没有接近反向插入的绝对最优性。在 10k 个元素上,在预先保留的向量上获得了完美的反向插入最优性。这给了我们 300 万个周期;我们在这里观察到 4.8M 的有序插入flat_map(因此是最佳值的 160%)。

mixed insert of 10000 elements without reservationAnalysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles comes from having to shift half (in average) the data upward (one element by one element) at each insertion.

mixed insert of 10000 elements without reservation分析:请记住,这是向量的“随机插入”,因此大量 10 亿个周期来自每次插入时必须向上移动一半(平均)数据(一个元素一个元素)。

Random search of 3 elements (clocks renormalized to 1)

随机搜索 3 个元素(时钟重新归一化为 1)

in size = 100

大小 = 100

rand search within container of 100 elements

rand search within container of 100 elements

in size = 10000

大小 = 10000

rand search within container of 10000 elements

rand search within container of 10000 elements

Iteration

迭代

over size 100 (only MediumPod type)

超过 100 码(仅限 MediumPod 类型)

Iteration over 100 medium pods

Iteration over 100 medium pods

over size 10000 (only MediumPod type)

超过 10000 号(仅限 MediumPod 类型)

Iteration over 10000 medium pods

Iteration over 10000 medium pods

Final grain of salt

最后一粒盐

In the end I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_mapuse cases (discussed here).
Which makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.

最后,我想回到“基准 §3 Pt1”(系统分配器)。在我最近围绕我开发的开放地址哈希映射的性能进行一项实验中,我测得 Windows 7 和 Windows 8 之间在某些std::unordered_map用例上的性能差距超过 3000% (此处讨论)。
这让我想警告读者上述结果(它们是在 Win7 上制作的):您的里程可能会有所不同。

best regards

此致

回答by Ylisar

From the docs it seems this is analogous to Loki::AssocVectorwhich I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:

从文档看来,这类似于Loki::AssocVector我是一个相当重的用户。由于它是基于向量的,所以它具有向量的特性,也就是说:

  • Iterators gets invalidated whenever sizegrows beyond capacity.
  • When it grows beyond capacityit needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at endwhen capacity > size
  • Lookup is faster than std::mapdue to cache locality, a binary search which has the same performance characteristics as std::mapotherwise
  • Uses less memory because it isn't a linked binary tree
  • It never shrinks unless you forcibly tell it to ( since that triggers reallocation )
  • 只要size超过 ,迭代器就会失效capacity
  • 增长超过capacity它需要重新分配,并在移动物体上,即插入,不能保证一定的时间,除了插入的特殊情况end时,capacity > size
  • 查找快于std::map缓存局部性,它具有相同的性能特征二进制搜索由于std::map否则
  • 使用较少的内存,因为它不是链接的二叉树
  • 它永远不会缩小,除非你强行告诉它(因为这会触发重新分配)

The best use is when you know the number of elements in advance ( so you can reserveupfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.

最好的用途是当您提前知道元素的数量时(这样您就可以reserve预先知道),或者当插入/删除很少但查找频繁时。迭代器失效使其在某些用例中有点麻烦,因此它们在程序正确性方面不可互换。