scala 确定 List 是否包含重复项的最简单方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7691971/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 03:32:58  来源:igfitidea点击:

Easiest way to decide if List contains duplicates?

scala

提问by Jus12

One way is this

一种方法是这样

list.distinct.size != list.size

Is there any better way? It would have been nice to have a containsDuplicatesmethod

有没有更好的办法?要是有containsDuplicates方法就好了

采纳答案by timday

Assuming "better" means "faster", see the alternative approaches benchmarked in this question, which seems to show some quicker methods (although note that distinct uses a HashSet and is already O(n)). YMMV of course, depending on specific test case, scala version etc. Probably any significant improvement over the "distinct.size" approach would come from an early-out as soon as a duplicate is found, but how much of a speed-up is actually obtained would depend strongly on how common duplicates actually are in your use-case.

假设“更好”意味着“更快”,请参阅此问题中基准的替代方法,这似乎显示了一些更快的方法(尽管请注意,distinct 使用 HashSet 并且已经是 O(n))。YMMV 当然,这取决于特定的测试用例、scala 版本等。 可能对“distinct.size”方法的任何重大改进都来自于发现重复项后立即退出,但有多少加速实际获得的将在很大程度上取决于您的用例中实际重复的常见程度。

If you mean "better" in that you want to write list.containsDuplicatesinstead of containsDuplicates(list), use an implicit:

如果您的意思是“更好”,因为您想编写list.containsDuplicates而不是containsDuplicates(list),请使用隐式:

implicit def enhanceWithContainsDuplicates[T](s:List[T]) = new {
  def containsDuplicates = (s.distinct.size != s.size)
}

assert(List(1,2,2,3).containsDuplicates)
assert(!List("a","b","c").containsDuplicates)

回答by paradigmatic

You can also write:

你也可以写:

list.toSet.size != list.size

But the result will be the same because distinctis already implemented with a Set. In both case the time complexity should be O(n): you must traverse the list and Setinsertion is O(1).

但结果将是相同的,因为distinct已经使用Set. 在这两种情况下,时间复杂度都应该是O(n):您必须遍历列表并且Set插入是O(1)

回答by huynhjl

I think this would stop as soon as a duplicate was found and is probably more efficient than doing distinct.size- since I assume distinctkeeps a set as well:

我认为这会在发现重复后立即停止,并且可能比这样做更有效distinct.size- 因为我假设也distinct保留了一个集合:

@annotation.tailrec
def containsDups[A](list: List[A], seen: Set[A] = Set[A]()): Boolean = 
  list match {
    case x :: xs => if (seen.contains(x)) true else containsDups(xs, seen + x)
    case _ => false
}

containsDups(List(1,1,2,3))
// Boolean = true

containsDups(List(1,2,3))
// Boolean = false

I realize you asked for easy and I don't now that this version is, but finding a duplicate is also finding if there is an element that has been seen before:

我意识到你要求简单,我现在不知道这个版本是,但是找到一个重复的也是发现是否有一个以前见过的元素:

def containsDups[A](list: List[A]): Boolean =  {
  list.iterator.scanLeft(Set[A]())((set, a) => set + a) // incremental sets
    .zip(list.iterator)
    .exists{ case (set, a) => set contains a }
}

回答by user unknown

@annotation.tailrec 
def containsDuplicates [T] (s: Seq[T]) : Boolean = 
  if (s.size < 2) false else 
    s.tail.contains (s.head) || containsDuplicates (s.tail)

I didn't measure this, and think it is similar to huynhjl's solution, but a bit more simple to understand.

这个我没测过,觉得和huynhjl的解决方案差不多,但是理解起来简单一点。

It returns early, if a duplicate is found, so I looked into the source of Seq.contains, whether this returns early - it does.

如果发现重复,它会提前返回,所以我查看了 Seq.contains 的来源,是否提前返回 - 确实如此。

In SeqLike, 'contains (e)' is defined as 'exists (_ == e)', and exists is defined in TraversableLike:

在 SeqLike 中,'contains (e)' 定义为 'exists (_ == e)',而存在在 TraversableLike 中定义:

def exists (p: A => Boolean): Boolean = {
  var result = false
  breakable {
    for (x <- this)
      if (p (x)) { result = true; break }
  }
  result
}

I'm curious how to speed things up with parallel collections on multi cores, but I guess it is a general problem with early-returning, while another thread will keep running, because it doesn't know, that the solution is already found.

我很好奇如何通过多核上的并行集合加快速度,但我想这是提前返回的普遍问题,而另一个线程将继续运行,因为它不知道已经找到了解决方案。

回答by chaotic3quilibrium

Summary:I've written a very efficient function which returns both List.distinctand a Listconsisting of each element which appeared more than once and the index at which the element duplicate appeared.

总结:我编写了一个非常有效的函数,它返回List.distinct一个List由出现多次的每个元素和出现重复元素的索引组成的 和 。

Note: This answer is a straight copy of the answer on a related question.

注意:此答案是相关问题答案的直接副本

Details:If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List(as ordering was significant) exactly once and returns a Tuple2consisting of the original Listdeduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second Listshowing each duplicate and an Intindex at which it occurred within the original List.

详细信息:如果您需要更多关于重复项本身的信息,就像我一样,我已经编写了一个更通用的函数,该函数在 a List(因为排序很重要)中迭代一次,并返回一个Tuple2由原始List重复数据组成的(所有重复之后的重复项)第一个被删除;即与调用distinct)相同,第二个List显示每个重复项和Int它在原始 中出现的索引List

Here's the function:

这是函数:

def filterDupes[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(remaining: List[A], index: Int, accumulator: (List[A], List[(A, Int)])): (List[A], List[(A, Int)]) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head))
            (accumulator._1, (remaining.head, index) :: accumulator._2)
          else
            (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items, 0, (Nil, Nil))
  (distinct.reverse, dupes.reverse)
}

An below is an example which might make it a bit more intuitive. Given this List of String values:

下面是一个可能使它更直观的示例。鉴于此字符串值列表:

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...and then performing the following:

...然后执行以下操作:

val (deduped, dupeAndIndexs) =
  filterDupes(withDupes)

...the results are:

...结果是:

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

And if you just want the duplicates, you simply mapacross dupeAndIndexesand invoke distinct:

如果您只想要重复项,您只需map跨越dupeAndIndexes并调用distinct

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...or all in a single call:

...或全部在一个电话中:

val dupesOnly =
  filterDupes(withDupes)._2.map(_._1).distinct

...or if a Setis preferred, skip distinctand invoke toSet...

...或者如果 aSet是首选,跳过distinct并调用toSet...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...or all in a single call:

...或全部在一个电话中:

val dupesOnly2 =
  filterDupes(withDupes)._2.map(_._1).toSet

This is a straight copy of the filterDupesfunction out of my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.

这是filterDupes我的开源 Scala 库ScalaOlio 中函数的直接副本。它位于org.scalaolio.collection.immutable.List_._

回答by Hosam Aly

If you're trying to check for duplicates in a test then ScalaTest can be helpful.

如果您尝试在测试中检查重复项,那么 ScalaTest 可能会有所帮助。

import org.scalatest.Inspectors._
import org.scalatest.Matchers._
forEvery(list.distinct) { item =>
  withClue(s"value $item, the number of occurences") {
    list.count(_ == item) shouldBe 1
  }
}

// example:
scala> val list = List(1,2,3,4,3,2)
list: List[Int] = List(1, 2, 3, 4, 3, 2)

scala> forEvery(list) { item => withClue(s"value $item, the number of occurences") { list.count(_ == item) shouldBe 1 } }
org.scalatest.exceptions.TestFailedException: forEvery failed, because:
  at index 1, value 2, the number of occurences 2 was not equal to 1 (<console>:19),
  at index 2, value 3, the number of occurences 2 was not equal to 1 (<console>:19)
in List(1, 2, 3, 4)