C++ 删除重复项和对向量进行排序的最有效方法是什么？

Question

提问by Kyle Ryan

I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.

我需要使用可能包含很多元素的 C++ 向量，删除重复项并对其进行排序。

I currently have the below code, but it doesn't work.

我目前有以下代码，但它不起作用。

vec.erase(
      std::unique(vec.begin(), vec.end()),
      vec.end());
std::sort(vec.begin(), vec.end());

How can I correctly do this?

我怎样才能正确地做到这一点？

Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after std::uniqueis executed?

此外，是先擦除重复项（类似于上面的编码）还是先执行排序更快？如果我先执行排序，是否保证在std::unique执行后保持排序？

Or is there another (perhaps more efficient) way to do all this?

或者是否有另一种（也许更有效）的方法来完成这一切？

Answer 1

回答by Nate Kohl

I agree with R. Pateand Todd Gardner; a std::setmight be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

我同意R. Pate和Todd Gardner 的观点；astd::set在这里可能是个好主意。即使您坚持使用矢量，如果您有足够的重复项，您最好创建一个集合来完成肮脏的工作。

Let's compare three approaches:

让我们比较三种方法：

Just using vector, sort + unique

仅使用向量，排序 + 唯一

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

Convert to set (manually)

转换为设置（手动）

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

Convert to set (using a constructor)

转换为集合（使用构造函数）

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

Here's how these perform as the number of duplicates changes:

以下是这些在重复数量变化时的表现：

comparison of vector and set approaches

向量和集合方法的比较

Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

总结：当重复的数量足够大时，转换为集合然后将数据转储回向量实际上更快。

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

出于某种原因，手动进行集合转换似乎比使用集合构造函数更快——至少在我使用的玩具随机数据上。

Answer 2

回答by alexk7

I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an unordered_set.

我重新编写了 Nate Kohl 的分析并得到了不同的结果。对于我的测试用例，直接对向量进行排序总是比使用集合更有效。我添加了一种新的更有效的方法，使用unordered_set.

Keep in mind that the unordered_setmethod only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)

请记住，该unordered_set方法仅适用于您需要唯一和排序的类型的良好散列函数。对于整数，这很容易！（标准库提供了一个默认的散列，它只是身份函数。）另外，不要忘记在最后排序，因为 unordered_set 是无序的 :)

I did some digging inside the setand unordered_setimplementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).

我在setandunordered_set实现中做了一些挖掘，发现构造函数实际上为每个元素构造了一个新节点，然后检查它的值以确定它是否应该实际插入（至少在 Visual Studio 实现中）。

Here are the 5 methods:

以下是5种方法：

f1: Just using vector, sort+ unique

f1：仅使用vector, sort+unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

f2: Convert to set(using a constructor)

f2：转换为set（使用构造函数）

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

f3: Convert to set(manually)

f3：转换为set（手动）

set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );

f4: Convert to unordered_set(using a constructor)

f4：转换为unordered_set（使用构造函数）

unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

f5: Convert to unordered_set(manually)

f5：转换为unordered_set（手动）

unordered_set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]

我使用在 [1,10]、[1,1000] 和 [1,100000] 范围内随机选择的 100,000,000 个整数的向量进行了测试

The results (in seconds, smaller is better):

结果（以秒为单位，越小越好）：

range         f1       f2       f3       f4      f5
[1,10]      1.6821   7.6804   2.8232   6.2634  0.7980
[1,1000]    5.0773  13.3658   8.2235   7.6884  1.9861
[1,100000]  8.7955  32.1148  26.5485  13.3278  3.9822

Answer 3

回答by jskinner

std::uniqueonly removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.

std::unique如果它们是邻居，则仅删除重复元素：您必须先对向量进行排序，然后才能按预期工作。

std::uniqueis defined to be stable, so the vector will still be sorted after running unique on it.

std::unique被定义为稳定的，所以向量在运行 unique 之后仍然会被排序。

Answer 4

回答by Todd Gardner

I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:

我不确定你用它做什么，所以我不能 100% 肯定地说，但通常当我认为“排序的、独特的”容器时，我会想到std::set。它可能更适合您的用例：

std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already

Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.

否则，在调用 unique （如其他答案所指出的）之前进行排序是要走的路。

Answer 5

回答by David Seiler

std::uniqueonly works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.

std::unique只适用于重复元素的连续运行，所以你最好先排序。但是，它是稳定的，因此您的向量将保持排序。

Answer 6

回答by DShook

Here's a template to do it for you:

这是为您做的模板：

template<typename T>
void removeDuplicates(std::vector<T>& vec)
{
    std::sort(vec.begin(), vec.end());
    vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
}

call it like:

称之为：

removeDuplicates<int>(vectorname);

Answer 7

回答by DShook

Efficiency is a complicated concept. There's time vs. space considerations, as well as general measurements (where you only get vague answers such as O(n)) vs. specific ones (e.g. bubble sort can be much faster than quicksort, depending on input characteristics).

效率是一个复杂的概念。有时间与空间的考虑，以及一般测量（您只能得到模糊的答案，如 O(n)）与特定的（例如，冒泡排序可能比快速排序快得多，这取决于输入特征）。

If you have relatively few duplicates, then sort followed by unique and erase seems the way to go. If you had relatively many duplicates, creating a set from the vector and letting it do the heavy lifting could easily beat it.

如果您的重复项相对较少，那么 sort 之后是 unique 和 erase 似乎是要走的路。如果您有相对较多的重复项，从向量创建一个集合并让它完成繁重的工作可以轻松击败它。

Don't just concentrate on time efficiency either. Sort+unique+erase operates in O(1) space, while the set construction operates in O(n) space. And neither directly lends itself to a map-reduce parallelization (for really hugedatasets).

也不要只关注时间效率。Sort+unique+erase 在 O(1) 空间中运行，而集合构造在 O(n) 空间中运行。并且两者都不直接适用于 map-reduce 并行化（对于非常大的数据集）。

Answer 8

回答by David Johnstone

You need to sort it before you call uniquebecause uniqueonly removes duplicates that are next to each other.

您需要在调用之前对其进行排序，unique因为unique只会删除彼此相邻的重复项。

edit: 38 seconds...

编辑：38秒...

Answer 9

回答by Peter

uniqueonly removes consecutive duplicate elements (which is necessary for it to run in linear time), so you should perform the sort first. It will remain sorted after the call to unique.

unique只删除连续的重复元素（这是它在线性时间内运行所必需的），因此您应该先执行排序。调用后，它将保持排序unique。

Answer 10

回答by yury

If you do not want to change the order of elements, then you can try this solution:

如果你不想改变元素的顺序，那么你可以试试这个解决方案：

template <class T>
void RemoveDuplicatesInVector(std::vector<T> & vec)
{
    set<T> values;
    vec.erase(std::remove_if(vec.begin(), vec.end(), [&](const T & value) { return !values.insert(value).second; }), vec.end());
}

C++ 删除重复项和对向量进行排序的最有效方法是什么？

提问by Kyle Ryan

回答by Nate Kohl

回答by alexk7

回答by jskinner

回答by Todd Gardner

回答by David Seiler

回答by DShook

回答by DShook

回答by David Johnstone

回答by Peter

回答by yury

相关推荐

最近更新

标签

C++ 删除重复项和对向量进行排序的最有效方法是什么？

提问by Kyle Ryan

回答by Nate Kohl

回答by alexk7

回答by jskinner

回答by Todd Gardner

回答by David Seiler

回答by DShook

回答by DShook

回答by David Johnstone

回答by Peter

回答by yury

相关推荐

C++ 函数覆盖

在头文件 (.h) 中声明构造函数然后在类文件 (.cpp) 中定义的语法 C++

C++ 为什么是 Switch/Case 而不是 If/Else If？

无法识别关闭函数 - C++

相关推荐

最近更新

标签