Java 在两个数组之间找到唯一元素的更快算法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19203868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 15:00:46  来源:igfitidea点击:

Faster algorithm to find unique element between two arrays?

javaarraysalgorithm

提问by William Gaul

EDIT: For anyone new to this question, I have posted an answer clarifying what was going on. The accepted answer is the one I feel best answers my question as originally posted, but for further details please see my answer.

编辑:对于这个问题的新手,我已经发布了一个答案,澄清了发生了什么。接受的答案是我认为最能回答我最初发布的问题的答案,但有关更多详细信息,请参阅我的答案。

NOTE: This problem was originally pseudocode and used lists. I have adapted it to Java and arrays. So while I'd love to see any solutions that use Java-specific tricks (or tricks in any language for that matter!), just remember that the original problem is language-independent.

注意:这个问题最初是伪代码并使用了列表。我已经将它改编为 Java 和数组。因此,虽然我很想看到任何使用 Java 特定技巧(或任何语言的技巧!)的解决方案,但请记住,原始问题与语言无关。

The Problem

问题

Let's say that there are two unsorted integer arrays aand b, with element repetition allowed. They are identical (with respect to contained elements) exceptone of the arrays has an extra element. As an example:

假设有两个未排序的整数数组ab,允许元素重复。它们是相同的(就包含的元素而言),只是其中一个数组有一个额外的元素。举个例子:

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

Design an algorithm that takes as input these two arrays and outputs the single unique integer (in the above case, 7).

设计一个算法,将这两个数组作为输入并输出单个唯一整数(在上述情况下为 7)。

The Solution (So Far)

解决方案(到目前为止)

I came up with this:

我想出了这个:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret ^= a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret ^= b[i];
    }
    return ret;
}

The "official" solution presented in class:

课堂上提出的“官方”解决方案:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret += a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret -= b[i];
    }
    return Math.abs(ret);
}

So, both are conceptually doing the same thing. And given that ais of length m and bis of length n, then both solutions have running time of O(m + n).

所以,两者在概念上都在做同样的事情。并且给定a长度为 m 和b长度为 n,那么两个解决方案的运行时间都是 O(m + n)。

The Question

问题

I later got to talking with my teacher and he hinted that there was an even fasterway of doing it. Honestly I don't see how; to find out whether an element isunique it seems you'd have to at least look at every element. At that's at least O(m + n)...right?

后来我和我的老师交谈,他暗示有一种更快的方法。老实说,我不明白怎么做;要找出元素是否唯一,似乎您至少必须查看每个元素。那至少是 O(m + n) ......对吗?

So is there a faster way? And if so, what is it?

那么有没有更快的方法呢?如果是这样,那是什么?

采纳答案by Shashank

This is probably the fastest you can do it in Java using HotLick's suggestion in the comments. It makes the assumption that b.length == a.length + 1so b is the larger array with the extra "unique" element.

这可能是您在 Java 中使用 HotLick 在评论中的建议所能做到的最快的。它假设b.length == a.length + 1so b 是具有额外“唯一”元素的较大数组。

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret = ret ^ a[i] ^ b[i];
    }
    return ret ^ b[i];
}

Even if the assumption cannot be made, you can easily expand it to include the case where either a or b can be the larger array with the unique element. It's still O(m+n) though and only loop/assignment overhead is reduced.

即使无法做出假设,您也可以轻松地将其扩展为包括 a 或 b 可以是具有唯一元素的较大数组的情况。尽管如此,它仍然是 O(m+n) 并且只减少了循环/分配开销。

Edit:

编辑:

Due to details of language implementation, this is still (surprisingly) the fastest way to do it in CPython.

由于语言实现的细节,这仍然是(令人惊讶的)在 CPython 中最快的方法。

def getUniqueElement1(A, B):
    ret = 0
    for a in A: ret = ret ^ a
    for b in B: ret = ret ^ b
    return ret

I have tested this with the timeitmodule and found some interesting results. It turns out that the longhand ret = ret ^ ais indeed faster in Python than the shorthand ret ^= a. Also iterating over the elements of a loop is much much faster than iterating over the indexes and then making subscript operations in Python. That is why this code is much faster than my previous method where I tried to copy Java.

我已经用timeit模块对此进行了测试,并发现了一些有趣的结果。事实证明,ret = ret ^ aPython 中的简写确实比简写快ret ^= a。此外,迭代循环的元素比迭代索引然后在 Python 中进行下标操作要快得多。这就是为什么这段代码比我之前尝试复制 Java 的方法快得多的原因。

I guess the moral of the story is that there is no correct answer because the question is bogus anyways. As the OP noted in another answer below, it turns out you can't really go any faster than O(m+n) on this and his teacher was just pulling his leg. Thus the problem reduces to finding the fastest way to iterate over all elements in the two arrays and accumulating the XOR of all of them. And this means it's entirely dependent on language implementation, and you have to do some testing and playing around to get the true "fastest" solution in whatever implementation you are using, because the overall algorithm will not change.

我想这个故事的寓意是没有正确的答案,因为无论如何这个问题都是假的。正如 OP 在下面的另一个答案中指出的那样,事实证明,在这方面你的速度不能比 O(m+n) 快,而他的老师只是在拉他的腿。因此,问题简化为找到迭代两个数组中所有元素的最快方法并累加所有元素的异或。这意味着它完全依赖于语言实现,您必须进行一些测试和尝试,才能在您使用的任何实现中获得真正“最快”的解决方案,因为整体算法不会改变。

回答by Peter Lawrey

You can store the count of each value in a collection such as an array or hash map. O(n) then you can check the values of the other collection and stop as soon as you know you have a miss match. This could mean you only search half the second array on average.

您可以将每个值的计数存储在一个集合中,例如数组或哈希映射。O(n) 然后你可以检查另一个集合的值,一旦你知道你有一个未命中匹配就停止。这可能意味着您平均只搜索第二个数组的一半。

回答by A. I. Breveleri

This is a littlebit faster:

这是一个稍微有点快:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret += (a[i] - b[i]);
    }
    return Math.abs(ret - b[i]);
}

It's O(m), but the order doesn't tell the whole story. The loop part of the "official" solution has about 3 * m + 3 * n operations, and the slightly faster solution has 4 * m.

它是 O(m),但顺序并不能说明全部情况。“官方”解决方案的循环部分大约有3 * m + 3 * n次操作,稍微快一点的解决方案有4 * m。

(Counting the loop "i++" and "i < a.length" as one operation each).

(将循环“i++”和“i < a.length”算作一个操作)。

-Al.

-阿尔。

回答by Edwin Buck

Assuming only one element was added, and the arrays were identical to start with, you could hit O(log(base 2) n).

假设只添加了一个元素,并且数组一开始是相同的,你可以达到 O(log(base 2) n)。

The rationale is that any array is subject to searching binary-ly O(log n). Except that in this case you are not searching for a value in an ordered array, you are searching for the first non-matching element. In such a circumstance a[n] == b[n] means that you are too low, and a[n] != b[n] means that you might be too high, unless a[n-1] == b[n-1].

基本原理是任何数组都需要进行二进制搜索 O(log n)。除了在这种情况下您不是在有序数组中搜索值,您是在搜索第一个不匹配的元素。在这种情况下 a[n] == b[n] 意味着你太低了,而 a[n] != b[n] 意味着你可能太高了,除非 a[n-1] == b [n-1]。

The rest is basic binary search. Check the middle element, decide which division must have the answer, and do a sub-search on that division.

剩下的就是基本的二分查找。检查中间元素,决定哪个部门必须有答案,然后对该部门进行子搜索。

回答by Neeraj

I think this is similar to Matching nuts and bolts problem.

我认为这类似于匹配螺母和螺栓问题

You could achieve this possibly in O(nlogn). Not sure if thats smaller than O(n+m) in this case.

您可以在 O(nlogn) 中实现这一点。不确定在这种情况下是否小于 O(n+m)。

回答by Ken Kin

Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical(with respect to contained elements) exceptone of the arrays has an extra element..

假设有两个未排序的整数数组 a 和 b,允许元素重复。它们是相同的(就包含的元素而言),只是其中一个数组有一个额外的元素..

You may note that I emphasised two point in your original question, and I'm adding an extra assumption of that the values are non-zero.

您可能会注意到,我在您的原始问题中强调了两点,并且我添加了一个额外的假设,即这些值非零

In C#, you can do this:

在 C# 中,你可以这样做:

int[, , , , ,] a=new int[6, 5, 6, 3, 4, 2];
int[, , , , , ,] b=new int[5, 7, 6, 6, 2, 3, 4];
Console.WriteLine(b.Length/a.Length);

See? Whatever the extra elementis, you will always know it by simply dividing their length.

看?无论额外的元素是什么,您总是可以通过简单地划分它们的长度来了解它。

With these statements, we are not storing the given series of integers as values to arrays, but as their dimensions.

使用这些语句,我们不是将给定的整数系列存储为数组的值,而是存储为它们的维度

As whatever the shorter series of integers is given, the longer one should have only one extra integer. So no matter the order of the integers, without the extra one, the total size of these two multi-dimensional array are identical. The extra dimension times the size of the longer, and to divide by the size of the shorter, we know what is the extra integer.

无论给出的整数序列较短,较长的应该只有一个额外的整数。所以无论整数的顺序如何,如果没有多余的一个,这两个多维数组的总大小是相同的。多余的维度乘以较长的尺寸,再除以较短的尺寸,我们就知道什么是多余的整数了。

This solution would works only for this particular case as I quoted from your question. You might want to port it to Java.

正如我从您的问题中引用的那样,此解决方案仅适用于这种特殊情况。您可能希望将其移植到 Java。

This is just a trick, as I thought the question itself is a trick. We definitely will not consider it as a solution for production.

这只是一个技巧,因为我认为问题本身就是一个技巧。我们绝对不会将其视为生产解决方案。

回答by Hans Hohenfeld

There simply is no faster algorithm. The ones presented in the question are in O(n). Any arithmetic "trick" to solve this will require at least each element of both arrays to be read once, so we stay in O(n) (or worse).

根本没有更快的算法。问题中提出的那些在 O(n) 中。任何解决这个问题的算术“技巧”都需要至少读取两个数组的每个元素一次,所以我们保持在 O(n)(或更糟)。

Any search strategy that is in a real subset of O(n) (like O(log n)) will require sorted arrays or some other prebuild sorted structure (binary tree, hash). All sorting algorithms known to mankind are at least O(n*log n) (Quicksort, Hashsort) at average which is worse than O(n).

O(n) 的实际子集中的任何搜索策略(如 O(log n))都需要排序数组或其他一些预构建排序结构(二叉树、哈希)。人类已知的所有排序算法平均至少为 O(n*log n)(Quicksort,Hashsort),这比 O(n) 差。

Therefore, from a mathematical point of view, there is no fasteralgorithm. There might be some code optimizations, but they won't matter on large scale, as runtime will grow linear with the length of the array(s).

因此,从数学的角度来看,没有更快的算法。可能会有一些代码优化,但它们在大规模上无关紧要,因为运行时会随着数组的长度线性增长。

回答by William Gaul

Alright here we go...apologies to anyone expecting a faster solution. It turns out my teacher was having a little fun with me and I completely missed the point of what he was saying.

好的,我们开始......向任何期待更快解决方案的人道歉。事实证明,我的老师和我玩得很开心,我完全没听懂他说的重点。

I should begin by clarifying what I meant by:

我应该首先澄清我的意思:

he hinted that there was an even fasterway of doing it

他暗示有一种更快的方法来做到这一点

The gist of our conversation was this: he said that my XOR approach was interesting, and we talked for a while about how I arrived at my solution. He asked me whether I thought my solution was optimal. I said I did (for the reasons I mentioned in my question). Then he asked me, "Are you sure?" with a look on his face I can only describe as "smug". I was hesitant but said yeah. He asked me if I could think of a better way to do it. I was pretty much like, "You mean there's a faster way?" but instead of giving me a straight answer he told me to think about it. I said I would.

我们谈话的要点是:他说我的 XOR 方法很有趣,我们聊了一会儿我是如何得出我的解决方案的。他问我是否认为我的解决方案是最佳的。我说我做了(出于我在问题中提到的原因)。然后他问我:“你确定吗?” 看着他的脸,我只能用“自鸣得意”来形容。我犹豫了一下,但说是的。他问我是否可以想出更好的方法来做到这一点。我很像,“你是说有更快的方法?” 但他没有直接回答我,而是让我考虑一下。我说我愿意。

So I thought about it, sure that my teacher knew something I didn't. And after not coming up with anything for a day, I came here.

所以我想了想,确定我的老师知道一些我不知道的东西。在一天没有想出任何东西之后,我来到了这里。

What my teacher actually wanted me to do was defendmy solution as being optimal, nottry to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part is proving it works (and that it's the best). He thought it was quite funny that I spent so much time in Find-A-Better-Way Land instead of working out a simple proof of O(n) that would have taken considerably less time (we ended up doing so, see below if you're interested).

我的老师实际上希望我做的是捍卫我的解决方案是最佳的,而不是试图找到更好的解决方案。正如他所说:创建一个好的算法是容易的部分,困难的部分是证明它有效(而且它是最好的)。他认为我花了这么多时间在 Find-A-Better-Way Land 上而不是制定 O(n) 的简单证明,这会花费更少的时间(我们最终这样做了,如果你有兴趣)。

So I guess, big lesson learned here. I'll be accepting Shashank Gupta's answer because I think that it doesmanage to answer the original question, even though the question was flawed.

所以我想,在这里学到了很大的教训。我会接受 Shashank Gupta 的回答,因为我认为它确实能够回答最初的问题,即使这个问题有缺陷。

I'll leave you guys with a neat little Python one-liner I found while typing the proof. It's not any more efficient but I like it:

我会给你们留下我在输入证明时发现的一个整洁的小 Python 单行代码。它并没有更有效,但我喜欢它:

def getUniqueElement(a, b):
    return reduce(lambda x, y: x^y, a + b)

A Very Informal "Proof"

一个非常非正式的“证明”

Let's start with the original two arrays from the question, aand b:

让我们从问题中的原始两个数组开始,a然后b

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

We'll say here that the shorter array has length n, then the longer array must have length n + 1. The first step to proving linear complexity is to append the arrays together into a third array (we'll call it c):

我们在这里会说较短的数组有 length n,那么较长的数组必须有 length n + 1。证明线性复杂度的第一步是将数组附加到第三个数组中(我们称之为c):

int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4};

which has length 2n + 1. Why do this? Well, now we have another problem entirely: finding the element that occurs an odd number of times in c(from here on "odd number of times" and "unique" are taken to mean the same thing). This is actually a pretty popular interview questionand is apparently where my teacher got the idea for his problem, so now my question has some practical significance. Hooray!

其中有长度2n + 1。为什么要这样做?好吧,现在我们完全有另一个问题:找到出现奇数次的元素c(从这里开始,“奇数次”和“唯一”被认为是同一件事)。这实际上是一个非常受欢迎的面试问题,显然是我的老师对他的问题有了想法,所以现在我的问题有一些实际意义。万岁!

Let's assume there isan algorithm faster than O(n), such as O(log n). What this means is that it will only access someof the elements of c. For example, an O(log n) algorithm might only have to check log(13) ~ 4 of the elements in our example array to determine the unique element. Our question is, is this possible?

让我们假设存在一种算法比为O(n),快如O(log n)的。这意味着,这将只能访问一些的元素c。例如,O(log n) 算法可能只需要检查我们示例数组中的 log(13) ~ 4 个元素来确定唯一元素。我们的问题是,这可能吗?

First let's see if we can get away with removing anyof the elements (by "removing" I mean not having to access it). How about if we remove 2 elements, so that our algorithm only checks a subarray of cwith length 2n - 1? This is still linear complexity, but if we can do that then maybe we can improve upon it even further.

首先让我们看看我们是否可以删除任何元素(“删除”我的意思是不必访问它)。如果我们删除 2 个元素,那么我们的算法只检查一个c长度为 的子数组2n - 1怎么样?这仍然是线性复杂度,但如果我们能做到这一点,那么也许我们可以进一步改进它。

So, let's choose two elements of ccompletely at random to remove. There are actually several things that could happen here, which I'll summarize into cases:

所以,让我们c完全随机地选择两个元素来移除。实际上有几件事可能会在这里发生,我将总结成案例:

// Case 1: Remove two identical elements
{6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4};

// Case 2: Remove the unique element and one other element
{6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4};

// Case 3: Remove two different elements, neither of which are unique
{6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4};

What does our array now look like? In the first case, 7 is still the unique element. In the second case there is a newunique element, 5. And in the third case there are now 3 unique elements...yeah it's a total mess there.

我们的数组现在是什么样子的?在第一种情况下,7 仍然是唯一元素。在第二种情况下,有一个新的独特元素,5。在第三种情况下,现在有 3 个独特的元素……是的,那里一团糟。

Now our question becomes: can we determine the unique element of cjust by looking at this subarray? In the first case we see that 7 is the unique element of the subarray, but we can't be sure it is also the unique element of c; the two removed elements could have just as well been 7 and 1. A similar argument applies for the second case. In case 3, with 3 unique elements we have no way of telling which two are non-unique in c.

现在我们的问题变成了:我们可以c通过查看这个子数组来确定 的唯一元素吗?在第一种情况下,我们看到 7 是子数组的唯一元素,但我们不能确定它也是 的唯一元素c;两个被删除的元素也可以是 7 和 1。类似的论点适用于第二种情况。在第 3 种情况下,有 3 个唯一元素,我们无法判断 中哪两个是非唯一元素c

It becomes clear that even with 2n - 1accesses, there is just not enough information to solve the problem. And so the optimal solution is a linear one.

很明显,即使有2n - 1访问权限,也没有足够的信息来解决问题。所以最优解是线性的。

Of course, a real proof would use induction and not use proof-by-example, but I'll leave that to someone else :)

当然,真正的证明将使用归纳法而不是使用示例证明,但我会将其留给其他人:)

回答by Yves Daoust

Caution, it is wrong to use the O(n + m) notation. There is but one size parameter which is n (in the asymptotic sense, n and n+1 are equal). You should just say O(n). [For m > n+1, the problem is different and more challenging.]

注意,使用 O(n + m) 表示法是错误的。只有一个大小参数是 n(在渐近意义上,n 和 n+1 相等)。你应该只说 O(n)。[对于 m > n+1,问题不同,更具挑战性。]

As pointed by others, this is optimal as you must read all values.

正如其他人所指出的,这是最佳的,因为您必须阅读所有值。

All you can do is reducing the asymptotic constant. There is little room for improvement, as the obvious solutions are already very efficient. The single loop in (10) is probably hard to beat. Unrolling it a bit should improve (slightly) by avoiding a branch.

你所能做的就是减少渐近常数。几乎没有改进的余地,因为显而易见的解决方案已经非常有效。(10) 中的单个循环可能很难被击败。通过避免分支,稍微展开它应该会有所改善(稍微)。

If your goal is sheer performance, than you should turn to non-portable solutions such as vectorization (using the AXV instructions, 8 ints at a time) and parallelization on multicores or GPGPU. In good old dirty C and a 64 bits processor, you could map the data to an array of 64 bit ints and xor the elements two pairs at a time ;)

如果您的目标是纯粹的性能,那么您应该转向非便携式解决方案,例如矢量化(使用 AXV 指令,一次 8 个整数)和多核或 GPGPU 上的并行化。在旧的脏 C 和 64 位处理器中,您可以将数据映射到 64 位整数数组,并一次对两个元素进行异或;)