list 高效的列表交集算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/497338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:19:26  来源:igfitidea点击:

Efficient list intersection algorithm

algorithmlistset-intersection

提问by

Given two lists (not necessarily sorted), what is the most efficient non-recursive algorithm to find the intersection of those lists?

给定两个列表(不一定已排序),找到这些列表的交集的最有效的非递归算法是什么?

回答by Frank

You could put all elements of the first list into a hash set. Then, iterate the second one and, for each of its elements, check the hash to see if it exists in the first list. If so, output it as an element of the intersection.

您可以将第一个列表的所有元素放入一个哈希集。然后,迭代第二个,对于它的每个元素,检查散列以查看它是否存在于第一个列表中。如果是,则将其作为交集的元素输出。

回答by Aneil Mallavarapu

You might want to take a look at Bloom filters. They are bit vectors that give a probabilistic answer whether an element is a member of a set. Set intersection can be implemented with a simple bitwise AND operation. If you have a large number of null intersections, the Bloom filter can help you eliminate those quickly. You'll still have to resort to one of the other algorithms mentioned here to compute the actual intersection, however. http://en.wikipedia.org/wiki/Bloom_filter

您可能想看看布隆过滤器。它们是位向量,给出一个元素是否是集合成员的概率答案。集合交集可以通过简单的按位与运算来实现。如果您有大量的空交点,布隆过滤器可以帮助您快速消除它们。但是,您仍然必须使用此处提到的其他算法之一来计算实际交集。 http://en.wikipedia.org/wiki/Bloom_filter

回答by Tom Ritter

without hashing, I suppose you have two options:

没有散​​列,我想你有两个选择:

  • The naive way is going to be compare each element to every other element. O(n^2)
  • Another way would be to sort the lists first, then iterate over them: O(n lg n) * 2 + 2 * O(n)
  • 天真的方法是将每个元素与其他每个元素进行比较。O(n^2)
  • 另一种方法是先对列表进行排序,然后对它们进行迭代:O(n lg n) * 2 + 2 * O(n)

回答by zvrba

From the eviews features listit seems that it supports complex merges and joins (if this is 'join' as in DB terminology, it will compute an intersection). Now dig through your documentation :-)

eviews 功能列表来看,它似乎支持复杂的合并和连接(如果这是 DB 术语中的“连接”,它将计算一个交集)。现在仔细阅读您的文档:-)

Additionally, eviews has their own user forum - why not ask there_

此外,eviews 有自己的用户论坛 - 为什么不在那里提问_

回答by khaja

with set 1 build a binary search tree with O(log n)and iterate set2 and search the BST m X O(log n)so total O(log n) + O(m)+O(log n) ==> O(log n)(m+1)

使用 set 1 构建一个二叉搜索树O(log n)并迭代 set2 并搜索BST m X O(log n)so 总数O(log n) + O(m)+O(log n) ==> O(log n)(m+1)

回答by quasar

in C++ the following can be tried using STL map

在 C++ 中,可以尝试使用 STL 映射

vector<int> set_intersection(vector<int> s1, vector<int> s2){

    vector<int> ret;
    map<int, bool> store;
    for(int i=0; i < s1.size(); i++){

        store[s1[i]] = true;
    }
    for(int i=0; i < s2.size(); i++){

        if(store[s2[i]] == true) ret.push_back(s2[i]);

    }
    return ret;
}

回答by Ayman Farhat

Here is another possible solution I came up with takes O(nlogn) in time complexity and without any extra storage. You can check it out here https://gist.github.com/4455373

这是我想出的另一种可能的解决方案,时间复杂度为 O(nlogn),无需任何额外存储。你可以在这里查看https://gist.github.com/4455373

Here is how it works: Assuming that the sets do not contain any repetition, merge all the sets into one and sort it. Then loop through the merged set and on each iteration create a subset between the current index i and i+n where n is the number of sets available in the universe. What we look for as we loop is a repeating sequence of size n equal to the number of sets in the universe.

这是它的工作原理:假设集合不包含任何重复,将所有集合合并为一个并对其进行排序。然后遍历合并的集合,并在每次迭代时在当前索引 i 和 i+n 之间创建一个子集,其中 n 是全域中可用的集合数。我们在循环时寻找的是一个大小为 n 的重复序列,它等于宇宙中集合的数量。

If that subset at i is equal to that subset at n this means that the element at i is repeated n times which is equal to the total number of sets. And since there are no repetitions in any set that means each of the sets contain that value so we add it to the intersection. Then we shift the index by i + whats remaining between it and n because definitely none of those indexes are going to form a repeating sequence.

如果 i 处的子集等于 n 处的子集,这意味着 i 处的元素重复 n 次,这等于集合的总数。并且由于任何集合中都没有重复,这意味着每个集合都包含该值,因此我们将其添加到交集。然后我们将索引移动 i + whats 在它和 n 之间的剩余部分,因为这些索引肯定不会形成重复序列。

回答by Wolf Garbe

Using skip pointersand SSE instructionscan improve list intersection efficiency.

使用跳过指针SSE 指令可以提高列表交集的效率。

回答by Wookai

First, sort both lists using quicksort : O(n*log(n). Then, compare the lists by browsing the lowest values first, and add the common values. For example, in lua) :

首先,使用 quicksort 对两个列表进行排序:O(n*log(n)。然后,通过先浏览最低值来比较列表,并添加共同值。例如,在 lua) 中:

function findIntersection(l1, l2)
    i, j = 1,1
    intersect = {}

    while i < #l1 and j < #l2 do
        if l1[i] == l2[i] then
            i, j = i + 1, j + 1
            table.insert(intersect, l1[i])
        else if l1[i] > l2[j] then
            l1, l2 = l2, l1
            i, j = j, i
        else
            i = i + 1
        end
    end

    return intersect
end

which is O(max(n, m))where nand mare the sizes of the lists.

这是O(max(n, m))在那里nm是列表的大小。

EDIT: quicksort is recursive, as said in the comments, but it looks like there are non-recursiveimplementations

编辑:快速排序是递归的,如评论中所述,但看起来有非递归实现

回答by Imran

Why not implement your own simple hash table or hash set? It's worth it to avoid nlogn intersection if your lists are large as you say.

为什么不实现自己的简单哈希表或哈希集?如果您的列表如您所说的那样大,那么避免 nlogn 交集是值得的。

Since you know a bit about your data beforehand, you should be able to choose a good hash function.

由于您事先对数据有所了解,因此您应该能够选择一个好的散列函数。