C++ gcc std::unordered_map 实现速度慢吗？如果是这样 - 为什么？

Question

提问by Markus Pilman

We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slower our concurrent hash map is compared with std::unordered_map.

我们正在用 C++ 开发一个高性能的关键软件。我们需要一个并发哈希映射并实现一个。所以我们写了一个基准来计算，我们的并发哈希映射比std::unordered_map.

But, std::unordered_mapseems to be incredibly slow... So this is our micro-benchmark (for the concurrent map we spawned a new thread to make sure that locking does not get optimized away and note that I never inser 0 because I also benchmark with google::dense_hash_map, which needs a null value):

但是，std::unordered_map似乎非常慢......所以这是我们的微基准测试（对于并发映射，我们生成了一个新线程以确保锁定不会被优化掉，并注意我从不插入 0，因为我也使用google::dense_hash_map,需要一个空值）：

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
    uint64_t val = 0;
    while (val == 0) {
        val = dist(rng);
    }
    vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
    map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
    val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;

(EDIT: the whole source code can be found here: http://pastebin.com/vPqf7eya)

（编辑：整个源代码可以在这里找到：http: //pastebin.com/vPqf7eya）

The result for std::unordered_mapis:

结果为std::unordered_map：

inserts: 35126
get    : 2959

For google::dense_map:

对于google::dense_map：

inserts: 3653
get    : 816

For our hand backed concurrent map (which does locking, although the benchmark is single threaded - but in a separate spawn thread):

对于我们手动支持的并发映射（它会锁定，尽管基准测试是单线程的 - 但在一个单独的 spawn 线程中）：

inserts: 5213
get    : 2594

If I compile the benchmark program without pthread support and run everything in the main thread, I get the following results for our hand backed concurrent map:

如果我在没有 pthread 支持的情况下编译基准程序并在主线程中运行所有内容，我会得到以下手动支持并发映射的结果：

inserts: 4441
get    : 1180

I compile with the following command:

我用以下命令编译：

g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc

So especially inserts on std::unordered_mapseem to be extremely expensive - 35 seconds vs 3-5 seconds for other maps. Also the lookup time seems to be quite high.

所以特别是插入std::unordered_map似乎非常昂贵 - 35 秒对其他地图的 3-5 秒。此外，查找时间似乎相当长。

My question: why is this? I read another question on stackoverflow where someone asks, why std::tr1::unordered_mapis slower than his own implementation. There the highest rated answer states, that the std::tr1::unordered_mapneeds to implement a more complicated interface. But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_mapuses a bucket-approach too (google::dense_hash_mapdoes not, but than std::unordered_mapshould be at least as fast than our hand backed concurrency-safe version?). Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly...

我的问题：这是为什么？我在 stackoverflow 上读到另一个问题，有人问，为什么std::tr1::unordered_map比他自己的实现慢。评分最高的答案指出，std::tr1::unordered_map需要实现更复杂的接口。但我看不到这个论点：我们在 concurrent_map 中std::unordered_map使用了桶方法，也使用了桶方法（google::dense_hash_map没有，但std::unordered_map应该至少比我们手支持的并发安全版本快？）。除此之外，我在界面中看不到任何强制执行使哈希映射性能不佳的功能的任何内容...

So my question: is it true that std::unordered_mapseems to be very slow? If no: what is wrong? If yes: what is the reason for that.

所以我的问题是：std::unordered_map看起来很慢是真的吗？如果不是：出了什么问题？如果是：这是什么原因。

And my main question: why is inserting a value into a std::unordered_mapso terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)?

我的主要问题是：为什么将一个值插入到一个std::unordered_map如此昂贵的值中（即使我们在开始时保留了足够的空间，它的性能也没有好多少——所以重新散列似乎不是问题）？

EDIT:

编辑：

First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).

首先：是的，所提供的基准测试并不是完美无缺的——这是因为我们对它进行了很多uint64尝试，它只是一个黑客（例如，生成整数的分布在实践中并不是一个好主意，在循环中排除 0有点愚蠢等......）。

At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory).

目前大多数评论都解释说，我可以通过为它预分配足够的空间来使 unordered_map 更快。在我们的应用程序中，这是不可能的：我们正在开发一个数据库管理系统，需要一个哈希映射来存储事务期间的一些数据（例如锁定信息）。所以这个映射可以是从 1（用户只进行一次插入和提交）到数十亿个条目（如果发生全表扫描）的所有内容。在这里预分配足够的空间是不可能的（并且在开始时分配很多会消耗太多内存）。

Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map).

此外，我很抱歉，我没有足够清楚地说明我的问题：我对快速制作 unordered_map 并不感兴趣（使用谷歌密集哈希图对我们来说很好用），我只是不明白这种巨大的性能差异来自哪里. 它不能只是预分配（即使有足够的预分配内存，密集映射也比 unordered_map 快一个数量级，我们手动支持的并发映射从大小为 64 的数组开始 - 因此比 unordered_map 小）。

So what is the reason for this bad performance of std::unordered_map? Or differently asked: Could one write an implementation of the std::unordered_mapinterface which is standard conform and (nearly) as fast as googles dense hash map? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it?

那么造成这种糟糕表现的原因是std::unordered_map什么呢？或者换一种方式问：是否可以编写一个std::unordered_map符合标准并且（几乎）与谷歌密集哈希映射一样快的接口实现？或者标准中有什么东西强制实施者选择一种低效的方式来实施它？

EDIT 2:

编辑2：

By profiling I see that a lot of time is used for integer divions. std::unordered_mapuses prime numbers for the array size, while the other implementations use powers of two. Why does std::unordered_mapuse prime-numbers? To perform better if the hash is bad? For good hashes it does imho make no difference.

通过分析，我发现很多时间都用于整数 divions。std::unordered_map使用素数作为数组大小，而其他实现使用二的幂。为什么要std::unordered_map使用质数？如果散列不好，性能会更好吗？对于好的哈希，恕我直言没有任何区别。

EDIT 3:

编辑 3：

These are the numbers for std::map:

这些是的数字std::map：

inserts: 16462
get    : 16978

Sooooooo: why are inserts into a std::mapfaster than inserts into a std::unordered_map... I mean WAT? std::maphas a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))!

Sooooooo：为什么插入 astd::map比插入a更快std::unordered_map......我的意思是 WAT？std::map具有更糟糕的局部性（树与数组），需要进行更多分配（每次插入与每次重新散列 + 每次碰撞加 ~1），最重要的是：具有另一种算法复杂度（O(logn) 与 O(1)）！

Answer 1

采纳答案by Markus Pilman

I found the reason: it is a Problem of gcc-4.7!!

找到原因了：是gcc-4.7的问题！！

With gcc-4.7

使用gcc-4.7

inserts: 37728
get    : 2985

With gcc-4.6

使用gcc-4.6

inserts: 2531
get    : 1565

So std::unordered_mapin gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).

所以std::unordered_map在 gcc-4.7 中被破坏了（或者我的安装，它是在 Ubuntu 上安装的 gcc-4.7.0 - 另一个安装是在 debian 测试中的 gcc 4.7.1）。

I will submit a bug report.. until then: DO NOT use std::unordered_mapwith gcc 4.7!

我将提交错误报告......在那之前：不要std::unordered_map与 gcc 4.7 一起使用！

Answer 2

回答by jxh

I am guessing that you have not properly sized your unordered_map, as Ylisar suggested. When chains grow too long in unordered_map, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_mapdefaults to (smallest prime larger than) 100.

我猜你没有unordered_map像 Ylisar 建议的那样正确调整你的. 当链unordered_map在 . 如果我没记错的话，unordered_map默认为 (smallest prime大于) 100。

I didn't have chronoon my system, so I timed with times().

chrono我的系统上没有，所以我用times().

template <typename TEST>
void time_test (TEST t, const char *m) {
    struct tms start;
    struct tms finish;
    long ticks_per_second;

    times(&start);
    t();
    times(&finish);
    ticks_per_second = sysconf(_SC_CLK_TCK);
    std::cout << "elapsed: "
              << ((finish.tms_utime - start.tms_utime
                   + finish.tms_stime - start.tms_stime)
                  / (1.0 * ticks_per_second))
              << " " << m << std::endl;
}

I used a SIZEof 10000000, and had to change things a bit for my version of boost. Also note, I pre-sized the hash table to match SIZE/DEPTH, where DEPTHis an estimate of the length of the bucket chain due to hash collisions.

我用SIZE的10000000，并且有改变的事情一点我的版本boost。另请注意，我预先调整了哈希表的大小以匹配SIZE/DEPTH，其中DEPTH是由于哈希冲突而导致的桶链长度的估计值。

Edit:Howard points out to me in comments that the max load factor for unordered_mapis 1. So, the DEPTHcontrols how many times the code will rehash.

编辑：霍华德在评论中向我指出最大负载因子unordered_map是1. 因此，DEPTH控制代码将重新散列多少次。

#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
                                  std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);

void
test_insert () {
    for (int i = 0; i < SIZE; ++i) {
        map[vec[i]] = 0.0;
    }
}

void
test_get () {
    long double val;
    for (int i = 0; i < SIZE; ++i) {
        val = map[vec[i]];
    }
}

int main () {
    for (int i = 0; i < SIZE; ++i) {
        uint64_t val = 0;
        while (val == 0) {
            val = dist(rng);
        }
        vec[i] = val;
    }
    time_test(test_insert, "inserts");
    std::random_shuffle(vec.begin(), vec.end());
    time_test(test_insert, "get");
}

Edit:

编辑：

I modified the code so that I could change out DEPTHmore easily.

我修改了代码，以便我可以DEPTH更轻松地进行更改。

#ifndef DEPTH
#define DEPTH 10000000
#endif

So, by default, the worst size for the hash table is chosen.

因此，默认情况下，选择哈希表的最差大小。

elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1

My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.

我的结论是，对于任何初始哈希表大小，除了使其等于唯一插入的整个预期数量之外，没有太大的性能差异。另外，我没有看到您观察到的数量级性能差异。

Answer 3

回答by Christian Leon

I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computerand it gave me the following results:

我使用64 位/AMD/4 核 (2.1GHz) 计算机运行了您的代码，结果如下：

MinGW-W64 4.9.2:

MinGW-W64 4.9.2：

Using std::unordered_map:

使用std::unordered_map：

inserts: 9280 
get: 3302

Using std::map:

使用std::map：

inserts: 23946
get: 24824

VC 2015 with all the optimization flags I know:

VC 2015 带有我知道的所有优化标志：

Using std::unordered_map:

使用std::unordered_map：

inserts: 7289
get: 1908

Using std::map:

使用std::map：

inserts: 19222 
get: 19711

I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_mapit's still broken.

我没有使用 GCC 测试过代码，但我认为它可能与 VC 的性能相当，所以如果这是真的，那么 GCC 4.9 std::unordered_map它仍然是坏的。

[EDIT]

[编辑]

So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.

所以是的，正如有人在评论中所说，没有理由认为 GCC 4.9.x 的性能会与 VC 的性能相媲美。当我进行更改时，我将在 GCC 上测试代码。

My answer is just to establish some kind of knowledge base to other answers.

我的答案只是为其他答案建立某种知识库。

C++ gcc std::unordered_map 实现速度慢吗？如果是这样 - 为什么？

提问by Markus Pilman

EDIT:

编辑：

EDIT 2:

编辑2：

EDIT 3:

编辑 3：

采纳答案by Markus Pilman

回答by jxh

回答by Christian Leon

相关推荐

最近更新

标签

C++ gcc std::unordered_map 实现速度慢吗？如果是这样 - 为什么？

提问by Markus Pilman

EDIT:

编辑：

EDIT 2:

编辑2：

EDIT 3:

编辑 3：

采纳答案by Markus Pilman

回答by jxh

回答by Christian Leon

相关推荐

C++ 得到一堆十字架初始化错误

C++中的覆盖和重载

使用 C++ 教程和示例代码的音频流

C++ 如何加速浮点到整数的转换？

相关推荐

最近更新

标签