C++ 在集合中查找重复元素并将它们分组的快速算法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1332527/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 19:35:41  来源:igfitidea点击:

what are the fast algorithms to find duplicate elements in a collection and group them?

c++algorithmelementsduplicate-data

提问by t.g.

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

假设您有一组元素,您如何挑选出重复的元素并将它们放入比较量最少的每个组中?最好用C++,但算法比语言更重要。例如给定 {E1,E2,E3,E4,E4,E2,E6,E4,E3},我希望提取出 {E2,E2}, {E3,E3}, {E4,E4,E4}。你会选择什么数据结构和算法?还请包括设置数据结构的成本,例如,如果它是像 std::multimap 这样的预先排序的

Updates

更新

To make things clearer as suggested. there's one constraint: the elements must be compared by themselvesto be certain they are duplicates.

按照建议使事情更清楚。有一个限制:元素必须自己比较以确保它们是重复的。

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

所以散列不适用,因为实际上它们将比较从重元素(例如数据块)转移到轻元素(整数),并减少一些比较,但不会取消它们,最后,我们回到我们最初的问题,什么时候在一个碰撞桶内。

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

假设您有一堆潜在的 GB 重复文件,每个人类知道的散列算法都具有相同的散列值。现在您将发现真正的重复项。

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

不,这不可能是现实生活中的问题(即使 MD5 也足以为现实生活中的文件生成唯一的哈希值)。但是假装这样我们就可以专注于寻找涉及最少比较的数据结构+算法



What I am doing is to

我正在做的是

  1. represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)

  2. pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.

  3. repeat the above 2 steps until the list is empty.

  1. 表示成 STL std::list 数据结构(因为 1)它的元素删除比向量更便宜 2)它的插入更便宜,不需要排序。)

  2. 弹出一个元素并将其与其余元素进行比较,如果找到重复项,则将其从列表中拉出。一旦到达列表末尾,就会发现一组重复项,如果有的话。

  3. 重复以上 2 个步骤,直到列表为空。

It needs N-1 in the best case, but (N-1)! in the worse case.

在最好的情况下它需要 N-1,但是 (N-1)!在更坏的情况下。

what are the better alternatives?

什么是更好的选择?



My code using method explained above:

我使用上面解释的方法的代码:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};


Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

感谢下面的回答,但是他们似乎被我的例子误导了,它是关于整数的。实际上元素是类型不可知的(不一定是整数、字符串或任何其他 POD),并且相等谓词是自定义的,即比较可能非常繁重

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

所以也许我的问题应该是:使用哪种数据结构 + 算法涉及较少的比较。

Using a pre-sorted container like multiset, multimap is not better according to my test, since

根据我的测试,使用像 multiset 这样的预排序容器,multimap 并不是更好,因为

  1. the sorting while inserting already does the comparisons,
  2. the following adjacent finding does comparison again,
  3. these data structure prefer less-than operations to equal operations, they perform 2 less-than(a
  1. 插入时的排序已经进行了比较,
  2. 以下相邻的发现再次进行比较,
  3. 这些数据结构更喜欢小于操作而不是相等操作,它们执行 2 个小于(a

I do not see how it can save comparisons.

我看不出它如何保存比较。



one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

下面的一些答案忽略了另一件事,我需要将重复的组彼此区分开来,而不仅仅是将它们保存在容器中。



Conclusion

结论

After all the discussion, there seem to be 3 ways

经过所有讨论,似乎有3种方法

  1. my original naive method as explained above
  2. Start with a linear container like std::vector, sort it and then locate the equal ranges
  3. start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.
  1. 我原来的天真方法如上所述
  2. 从一个线性容器开始,比如std::vector,对它进行排序,然后找到相等的范围
  3. 从关联容器开始,如std::map<Type, vector<duplicates>>Charles Bailey 建议的那样,在关联容器的设置过程中挑选出重复项。

I've coded a sample to test all the methods as posted below.

我编写了一个示例来测试下面发布的所有方法。

the number of duplicates and when they are distributed may influence the best choice.

重复的数量和它们的分布时间可能会影响最佳选择。

  • Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
  • Method 3 has the most average performance
  • Method 2 is never the best choice
  • 方法 1 在前面重重摔倒时最好,在最后时最差。Sort 不会改变分布,但会改变字节序。
  • 方法3的性能最平均
  • 方法2从来都不是最好的选择

Thanks for all who participating in the discussion.

感谢所有参与讨论的人。

one output with 20 sample items from the code below.

一个输出包含来自以下代码的 20 个示例项目。

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

using std::vector -> sort() -> adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector -> std::multiset -> adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

测试 [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

和 [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ]

使用 std::vector -> sort() -> neighbor_find():

比较: [ '<' = 139, '==' = 23 ]

比较: [ '<' = 38, '==' = 23 ]

使用 std::list -> sort() -> 缩小列表:

比较: [ '<' = 50, '==' = 43 ]

比较: [ '<' = 52, '==' = 43 ]

使用 std::list -> 缩小列表:

比较: [ '<' = 0, '==' = 121 ]

比较: [ '<' = 0, '==' = 43 ]

使用 std::vector -> std::map> :

比较: [ '<' = 79, '==' = 0 ]

比较: [ '<' = 53, '==' = 0 ]

使用 std::vector -> std::multiset -> 邻域查找():

比较:[ '<' = 79, '==' = 7]

比较: [ '<' = 53, '==' = 7 ]

代码

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

采纳答案by CB Bailey

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

您可以使用从代表性元素到其他元素的列表/向量/双端队列的映射。这在插入容器时需要相对较少的比较,这意味着您可以遍历结果组而无需执行任何比较。

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_backonly if (!ins_pair.second).

此示例始终将第一个代表性元素插入到映射的双端队列存储中,因为它使通过组的后续迭代在逻辑上变得简单,但是如果这种重复证明存在问题,那么只执行push_back唯一的if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

遍历组然后(相对)简单且便宜:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

我进行了一些比较和对象计数的实验。在以随机顺序使用 100000 个对象形成 50000 个组(即每组 2 个对象的平均值)的测试中,上述方法花费了以下数量的比较和副本:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

(我尝试减少副本数量,但只是以牺牲比较为代价才真正做到了,这在您的场景中似乎是成本较高的操作。)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

使用多重映射,并在最终迭代中保留前一个元素来检测组转换成本:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

使用单个列表并弹出前端元素并对其他组成员执行线性搜索成本:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

是的,与我最好的方法的 ~1.6m 相比,这是 ~1.9b 的比较。为了让 list 方法在接近最佳比较次数的任何地方执行,必须对其进行排序,这将花费与首先构建固有排序容器相似的比较次数。

Edit

编辑

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

我采用了您发布的代码并在与我之前使用的相同测试数据集上运行了隐含算法(我必须对代码做出一些假设,因为有一些假设的定义)并计算:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iteratefunction by adding in if (i->size > 1)clause.

即与我的哑列表算法完全相同的比较次数,但有更多的副本。我认为这意味着我们在这种情况下使用基本相似的算法。我看不到任何替代排序顺序的证据,但看起来您想要一个包含多个等效元素的组列表。这可以Iterate通过添加 inif (i->size > 1)子句在我的函数中简单地实现。

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

我仍然看不到任何证据表明构建一个排序的容器(例如这张双端队列地图)不是一个好的(即使不是最佳的)策略。

回答by derobert

Yes, you can do much better.

是的,你可以做得更好。

  1. Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)

  2. Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

  1. 对它们进行排序(简单整数为 O(n),一般为 O(n*log n)),然后保证重复项相邻,从而快速找到它们 O(n)

  2. 使用哈希表,也是 O(n)。对于每个项目,(a) 检查它是否已经在哈希表中;如果是这样,它是一个副本;如果没有,则将其放入哈希表中。

edit

编辑

The method you're using seems to do O(N^2) comparisons:

您使用的方法似乎进行 O(N^2) 比较:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

所以对于长度 5 你做 4+3+2+1=10 比较;对于 6,你做 15,等等。 (N^2)/2 - N/2 准确地说。N*log(N) 较小,对于任何合理的高 N 值。

How big is N in your case?

在你的情况下 N 有多大?

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

至于减少哈希冲突,最好的方法是获得更好的哈希函数:-D。假设这是不可能的,如果您可以创建一个变体(例如,不同的模数),您就可以进行嵌套散列。

回答by Nathan

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

1. 最坏排序数组 O(n log n) - 归并排序/堆排序/二叉树排序等

2. Compare neighbors and pull the matches out O(n)

2. 比较邻居并拉出匹配项 O(n)

回答by Alex Martelli

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map(not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

保持一个基于哈希表的结构从值到计数;如果您的 C++ 实现不提供std::hash_map(到目前为止还不是 C++ 标准的一部分!-),请使用 Boost 或从网络上获取一个版本。一次遍历集合(即 O(N))让你做一个 value->count 映射;再通过哈希表(<= O(N),显然)以识别计数 > 1 的值并适当地发出它们。总体 O(N),您的建议并非如此。

回答by Martin v. L?wis

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

如果已知是一个整数列表,并且知道它们都在A和B之间(例如A=0,B=9),则制作一个BA元素数组,并创建BA容器。

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

在非常特殊的情况下(纯整数列表),我建议您只计算它们,因为无论如何您都无法区分不同的整数:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they aredistinguishable, create an array of lists, and add them to the respective list.

如果它们可区分的,则创建一个列表数组,并将它们添加到相应的列表中。

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

如果它们不是数字,则使用 std::map 或 std::hash_map,将键映射到值列表。

回答by hhafez

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

你试过排序吗?例如使用像快速排序这样的算法?如果性能对您来说足够好,那么这将是一种简单的方法。

回答by ttvd

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

大多数提到哈希/无序映射解决方案的人都假设 O(1) 插入和查询时间,但它可能是 O(N) 最坏的情况。此外,您可以避免对象散列的成本。

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

就我个人而言,我会将对象插入二叉树(每个对象为 O(logn) 插入),并在每个节点保持计数器。这将产生 O(nlogn) 构造时间和 O(n) 遍历以识别所有重复项。

回答by lavinio

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

最简单的方法可能是对列表进行排序,然后对其进行迭代以查找重复项。

If you know something about the data, more efficient algorithms are possible.

如果您对数据有所了解,则可以使用更有效的算法。

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

例如,如果您知道列表很大,并且只包含 1..n 中的整数,其中 n 相当小,您可以使用一对布尔数组(或位图),并执行如下操作:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

现在, many[] 包含一个数组,其中的值被多次看到。

回答by jalf

If I've understood the question correctly, then this is the simplest solution I can think of:

如果我正确理解了这个问题,那么这是我能想到的最简单的解决方案:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n)and then a second pass where O(lg n)comparisons are performed for each group(so this is done at mostntimes, also yielding O(n lg n).

总运行时间:O(n log n). 您有一个排序过程O(n lg n),然后是第二个过程,其中对每个组O(lg n)进行比较(因此大多数时候这样做,也会产生.nO(n lg n)

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

请注意,这取决于输入是向量。只有随机访问迭代器在第二遍中具有对数复杂度。双向迭代器将是线性的。

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

这不依赖于散列(根据要求),它保留所有原始元素(而不是只为每个组返回一个元素,以及它发生的频率的计数)。

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size())on the output vector would be a good idea, to avoid reallocation. input.end()is taken much more often than necessary too, and could be easily cached.

当然,许多较小的常量优化是可能的。output.reserve(input.size())在输出向量上将是一个好主意,以避免重新分配。input.end()也比必要的更频繁,并且可以很容易地缓存。

Depending on how big the groups are assumed to be, equal_rangemay not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

根据假设的组有多大,equal_range可能不是最有效的选择。我假设它进行二分搜索以获得对数复杂度,但如果每个组只有几个元素,简单的线性扫描会更快。无论如何,初始排序在成本中占主导地位。

回答by Alexandre Rademaker

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

只是为了注册我在与我合作的三元组商店的规范化过程中遇到了同样的问题。我使用 Allegro Common Lisp 的哈希表功能在 Common Lisp 中实现了 Charles Bailey 总结的方法 3。

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

函数“代理等于?” 用于测试 TS 中的两个代理何时相同。函数“merge-nodes”合并每个集群上的节点。在下面的代码中,“...”用于删除不那么重要的部分。

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))