获取 Scala Iterable 的前 n 个元素的最简单方法

Question

提问by Stefan Endrullis

Is there a simple and efficient solution to determine the top n elements of a Scala Iterable? I mean something like

是否有一个简单有效的解决方案来确定 Scala Iterable 的前 n 个元素？我的意思是像

iter.toList.sortBy(_.myAttr).take(2)

but without having to sort all elements when only the top 2 are of interest. Ideally I'm looking for something like

但是当只对前 2 个元素感兴趣时不必对所有元素进行排序。理想情况下，我正在寻找类似的东西

iter.top(2, _.myAttr)

see also: Solution for the top element using an Ordering: In Scala, how to use Ordering[T] with List.min or List.max and keep code readable

另请参阅：使用排序的顶部元素的解决方案：在 Scala 中，如何使用带有 List.min 或 List.max 的 Ordering[T] 并保持代码可读

Update:

更新：

Thank you all for your solutions. Finally, I took the original solution of user unknownand adopted it to use Iterableand the pimp-my-librarypattern:

谢谢大家的解决方案。最后，我采用了user unknown的原始解决方案并采用了它Iterable和pimp-my-library模式：

implicit def iterExt[A](iter: Iterable[A]) = new {
  def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]): List[A] = {
    def updateSofar (sofar: List [A], el: A): List [A] = {
      //println (el + " - " + sofar)

      if (ord.compare(f(el), f(sofar.head)) > 0)
        (el :: sofar.tail).sortBy (f)
      else sofar
    }

    val (sofar, rest) = iter.splitAt(n)
    (sofar.toList.sortBy (f) /: rest) (updateSofar (_, _)).reverse
  }
}

case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, _.i))

Answer 1

采纳答案by user unknown

My solution (bound to Int, but should be easily changed to Ordered (a few minutes please):

我的解决方案（绑定到 Int，但应该很容易更改为 Ordered（请几分钟）：

def top (n: Int, li: List [Int]) : List[Int] = {

  def updateSofar (sofar: List [Int], el: Int) : List [Int] = {
    // println (el + " - " + sofar)
    if (el < sofar.head) 
      (el :: sofar.tail).sortWith (_ > _) 
    else sofar
  }

  /* better readable:
    val sofar = li.take (n).sortWith (_ > _)
    val rest = li.drop (n)
    (sofar /: rest) (updateSofar (_, _)) */    
  (li.take (n). sortWith (_ > _) /: li.drop (n)) (updateSofar (_, _)) 
}

usage:

用法：

val li = List (4, 3, 6, 7, 1, 2, 9, 5)    
top (2, li)

For above list, take the first 2 (4, 3) as starting TopTen (TopTwo).
Sort them, such that the first element is the bigger one (if any).
repeatedly iterate through the rest of the list (li.drop(n)), and compare the current element with the maximum of the list of minimums; replace, if neccessary, and resort again.
Improvements:
- Throw away Int, and use ordered.
- Throw away (_ > _) and use a user-Ordering to allow BottomTen. (Harder: pick the middle 10 :) )
- Throw away List, and use Iterable instead

对于上面的列表，取前 2 (4, 3) 个作为起始 TopTen (TopTwo)。
对它们进行排序，以便第一个元素是较大的元素（如果有）。
重复遍历列表的其余部分 (li.drop(n))，并将当前元素与最小值列表的最大值进行比较；如有必要，更换并再次使用。
改进：
- 扔掉Int，使用ordered。
- 扔掉 (_ > _) 并使用用户排序来允许BottomTen。（更难：选择中间的 10 个 :) ）
- 扔掉 List，改用 Iterable

update (abstraction):

更新（抽象）：

def extremeN [T](n: Int, li: List [T])
  (comp1: ((T, T) => Boolean), comp2: ((T, T) => Boolean)):
     List[T] = {

  def updateSofar (sofar: List [T], el: T) : List [T] =
    if (comp1 (el, sofar.head)) 
      (el :: sofar.tail).sortWith (comp2 (_, _)) 
    else sofar

  (li.take (n) .sortWith (comp2 (_, _)) /: li.drop (n)) (updateSofar (_, _)) 
}

/*  still bound to Int:  
def top (n: Int, li: List [Int]) : List[Int] = {
  extremeN (n, li) ((_ < _), (_ > _))
}
def bottom (n: Int, li: List [Int]) : List[Int] = {
  extremeN (n, li) ((_ > _), (_ < _))
}
*/

def top [T] (n: Int, li: List [T]) 
  (implicit ord: Ordering[T]): Iterable[T] = {
  extremeN (n, li) (ord.lt (_, _), ord.gt (_, _))
}
def bottom [T] (n: Int, li: List [T])
  (implicit ord: Ordering[T]): Iterable[T] = {
  extremeN (n, li) (ord.gt (_, _), ord.lt (_, _))
}

top (3, li)
bottom (3, li)
val sl = List ("Haus", "Garten", "Boot", "Sumpf", "X", "y", "xkcd", "x11")
bottom (2, sl)

To replace List with Iterable seems to be a bit harder.

用 Iterable 替换 List 似乎有点困难。

As Daniel C. Sobral pointed out in the comments, a high nin topN can lead to much sorting work, so that it could be useful, to do a manual insertion sort instead of repeatedly sorting the whole list of top-n elements:

正如 Daniel C. Sobral 在评论中指出的那样，ntopN 中的高会导致大量排序工作，因此执行手动插入排序而不是重复排序整个 top-n 元素列表可能很有用：

def extremeN [T](n: Int, li: List [T])
  (comp1: ((T, T) => Boolean), comp2: ((T, T) => Boolean)):
     List[T] = {

  def sortedIns (el: T, list: List[T]): List[T] = 
    if (list.isEmpty) List (el) else 
    if (comp2 (el, list.head)) el :: list else 
      list.head :: sortedIns (el, list.tail)

  def updateSofar (sofar: List [T], el: T) : List [T] =
    if (comp1 (el, sofar.head)) 
      sortedIns (el, sofar.tail)
    else sofar

  (li.take (n) .sortWith (comp2 (_, _)) /: li.drop (n)) (updateSofar (_, _)) 
}

top/bottom method and usage as above. For small groups of top/bottom Elements, the sorting is rarely called, a few times in the beginning, and then less and less often over time. For example, 70 times with top (10) of 10 000, and 90 times with top (10) of 100 000.

顶部/底部方法和用法同上。对于一小组顶部/底部元素，很少会调用排序，开始时会调用几次，然后随着时间的推移越来越少。例如，顶部 (10) 为 10 000 时为 70 次，顶部 (10) 为 100 000 时为 90 次。

Answer 2

回答by shellholic

Yet another version:

还有一个版本：

val big = (1 to 100000)

def maxes[A](n:Int)(l:Traversable[A])(implicit o:Ordering[A]) =
    l.foldLeft(collection.immutable.SortedSet.empty[A]) { (xs,y) =>
      if (xs.size < n) xs + y
      else {
        import o._
        val first = xs.firstKey
        if (first < y) xs - first + y
        else xs
      }
    }

println(maxes(4)(big))
println(maxes(2)(List("a","ab","c","z")))

Using the Setforce the list to have unique values:

使用Set强制列表具有唯一值：

def maxes2[A](n:Int)(l:Traversable[A])(implicit o:Ordering[A]) =
    l.foldLeft(List.empty[A]) { (xs,y) =>
      import o._
      if (xs.size < n) (y::xs).sort(lt _)
      else {
        val first = xs.head
        if (first < y) (y::(xs - first)).sort(lt _)
        else xs
      }
    }

Answer 3

回答by Tom Wang

Here's another solution that is simple and has pretty good performance.

这是另一个简单且性能非常好的解决方案。

def pickTopN[T](k: Int, iterable: Iterable[T])(implicit ord: Ordering[T]): Seq[T] {
  val q = collection.mutable.PriorityQueue[T](iterable.toSeq:_*)
  val end = Math.min(k, q.size)
  (1 to end).map(_ => q.dequeue())
}

The Big O is O(n + k log n), where k <= n. So the performance is linear for small kand at worst n log n.

大 O 是O(n + k log n)哪里k <= n。因此，对于 smallk和最坏的情况，性能是线性的n log n。

The solution can also be optimized to be O(k)for memory but O(n log k)for performance. The idea is to use a MinHeap to track only the top kitems at all times. Here's the solution.

该解决方案还可以针对O(k)内存O(n log k)和性能进行优化。这个想法是使用 MinHeap 始终只跟踪顶部的k项目。这是解决方案。

def pickTopN[A, B](n: Int, iterable: Iterable[A], f: A => B)(implicit ord: Ordering[B]): Seq[A] = {
  val seq = iterable.toSeq
  val q = collection.mutable.PriorityQueue[A](seq.take(n):_*)(ord.on(f).reverse) // initialize with first n

  // invariant: keep the top k scanned so far
  seq.drop(n).foreach(v => {
    q += v
    q.dequeue()
  })

  q.dequeueAll.reverse
}

Answer 4

回答by oxbow_lakes

You don't need to sort the entire collection in order to determine the top N elements. However, I don't believe that this functionality is supplied by the raw library, so you would have to roll you own, possibly using the pimp-my-librarypattern.

您不需要对整个集合进行排序来确定前 N 个元素。但是，我不相信这个功能是由原始库提供的，所以你必须自己动手，可能使用pimp-my-library模式。

For example, you can get the nth element of a collection as follows:

例如，您可以按如下方式获取集合的第 n 个元素：

  class Pimp[A, Repr <% TraversableLike[A, Repr]](self : Repr) {

    def nth(n : Int)(implicit ord : Ordering[A]) : A = {
      val trav : TraversableLike[A, Repr] = self
      var ltp : List[A] = Nil
      var etp : List[A] = Nil
      var mtp : List[A] = Nil
      trav.headOption match {
        case None      => error("Cannot get " + n + " element of empty collection")
        case Some(piv) =>
          trav.foreach { a =>
            val cf = ord.compare(piv, a)
            if (cf == 0) etp ::= a
            else if (cf > 0) ltp ::= a
            else mtp ::= a
          }
          if (n < ltp.length)
            new Pimp[A, List[A]](ltp.reverse).nth(n)(ord)
          else if (n < (ltp.length + etp.length))
            piv
          else
            new Pimp[A, List[A]](mtp.reverse).nth(n - ltp.length - etp.length)(ord)
      }
    }
  }

(This is not very functional; sorry)

（这不是很实用；抱歉）

It's then trivial to get the top nelements:

然后获取顶级n元素很简单：

def topN(n : Int)(implicit ord : Ordering[A], bf : CanBuildFrom[Repr, A, Repr]) ={
  val b = bf()
  val elem = new Pimp[A, Repr](self).nth(n)(ord)
  import util.control.Breaks._
  breakable {
    var soFar = 0
    self.foreach { tt =>
      if (ord.compare(tt, elem) < 0) {
         b += tt
         soFar += 1
      }
    }
    assert (soFar <= n)
    if (soFar < n) {
      self.foreach { tt =>
        if (ord.compare(tt, elem) == 0) {
          b += tt
          soFar += 1
        }
        if (soFar == n) break
      }
    }

  }
  b.result()
}

Unfortunately I'm having trouble getting this pimp to be discovered via this implicit:

不幸的是，我无法通过这种隐含的方式发现这个皮条客：

implicit def t2n[A, Repr <% TraversableLike[A, Repr]](t : Repr) : Pimp[A, Repr] 
  = new Pimp[A, Repr](t)

I get this:

我明白了：

scala> List(4, 3, 6, 7, 1, 2, 8, 5).topN(4)
<console>:9: error: could not find implicit value for evidence parameter of type (List[Int]) => scala.collection.TraversableLike[A,List[Int]]
   List(4, 3, 6, 7, 1, 2, 8, 5).topN(4)
       ^

However, the code actually works OK:

但是，代码实际上工作正常：

scala> new Pimp(List(4, 3, 6, 7, 1, 2, 8, 5)).topN(4)
res3: List[Int] = List(3, 1, 2, 4)

And

和

scala> new Pimp("ioanusdhpisjdmpsdsvfgewqw").topN(6)
res2: java.lang.String = adddfe

Answer 5

回答by thoredge

If the goal is to not sort the whole list then you could do something like this (of course it could be optimized a tad so that we don't change the list when the number clearly shouldn't be there):

如果目标是不对整个列表进行排序，那么您可以执行以下操作（当然可以稍微优化一下，以便在数字显然不应该存在时我们不会更改列表）：

List(1,6,3,7,3,2).foldLeft(List[Int]()){(l, n) => (n :: l).sorted.take(2)}

Answer 6

回答by michid

I implemented such an ranking algorithm recently in the Rankclass of Apache Hymanrabbit (in Java though). See the takemethod for the gist of it. The basic idea is to quicksort but terminate prematurely as soon as the top nelements have been found.

我最近在Apache Hymanrabbit的Rank类中实现了这样的排名算法（虽然是在 Java 中）。请参阅take方法以了解其要点。基本思想是快速排序，但一旦n找到顶部元素就过早终止。

Answer 7

回答by ponkin

Here is asymptotically O(n)solution.

这是渐近O(n)解。

def top[T](data: List[T], n: Int)(implicit ord: Ordering[T]): List[T] = {
    require( n < data.size)

    def partition_inner(shuffledData: List[T], pivot: T): List[T] = 
      shuffledData.partition( e => ord.compare(e, pivot) > 0 ) match {
          case (left, right) if left.size == n => left
          case (left, x :: rest) if left.size < n => 
            partition_inner(util.Random.shuffle(data), x)
          case (left @ y :: rest, right) if left.size > n => 
            partition_inner(util.Random.shuffle(data), y)
      }

     val shuffled = util.Random.shuffle(data)
     partition_inner(shuffled, shuffled.head)
}

scala> top(List.range(1,10000000), 5)

Due to recursion, this solution will take longer than some non-linear solutions above and can cause java.lang.OutOfMemoryError: GC overhead limit exceeded. But slightly more readable IMHO and functional style. Just for job interview ;).

由于递归，此解决方案将比上述一些非线性解决方案花费更长的时间，并可能导致java.lang.OutOfMemoryError: GC overhead limit exceeded. 但恕我直言和功能风格稍微更具可读性。仅用于面试;)。

What is more important, that this solution can be easily parallelized.

更重要的是，该解决方案可以轻松并行化。

def top[T](data: List[T], n: Int)(implicit ord: Ordering[T]): List[T] = {
    require( n < data.size)

    @tailrec
    def partition_inner(shuffledData: List[T], pivot: T): List[T] = 
      shuffledData.par.partition( e => ord.compare(e, pivot) > 0 ) match {
          case (left, right) if left.size == n => left.toList
          case (left, right) if left.size < n => 
            partition_inner(util.Random.shuffle(data), right.head)
          case (left, right) if left.size > n => 
            partition_inner(util.Random.shuffle(data), left.head)
      }

     val shuffled = util.Random.shuffle(data)
     partition_inner(shuffled, shuffled.head)
}

Answer 8

回答by Ravi Teja

An optimised solution using PriorityQueuewith Time Complexity of O(nlogk). In the approach given in the update, you are sorting the sofarlist every time which is not needed and below it is optimised by using PriorityQueue.

使用PriorityQueue时间复杂度为的优化解决方案O(nlogk)。在更新中给出的方法中，您sofar每次都对不需要的列表进行排序，并使用PriorityQueue.

import scala.language.implicitConversions
import scala.language.reflectiveCalls
import collection.mutable.PriorityQueue
implicit def iterExt[A](iter: Iterable[A]) = new {
    def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]) : List[A] = {
        def updateSofar (sofar: PriorityQueue[A], el: A): PriorityQueue[A] = {
            if (ord.compare(f(el), f(sofar.head)) < 0){
                sofar.dequeue
                sofar.enqueue(el)
            }
            sofar
        }

        val (sofar, rest) = iter.splitAt(n)
        (PriorityQueue(sofar.toSeq:_*)( Ordering.by( (x :A) => f(x) ) ) /: rest) (updateSofar (_, _)).dequeueAll.toList.reverse
    }
}

case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, -_.i))

Answer 9

回答by huynhjl

For small values of nand large lists, getting the top nelements can be implemented by picking out the max element ntimes:

对于小值n和大列表，n可以通过选取最大元素n次数来获取顶部元素：

def top[T](n:Int, iter:Iterable[T])(implicit ord: Ordering[T]): Iterable[T] = {
  def partitionMax(acc: Iterable[T], it: Iterable[T]): Iterable[T]  = {
    val max = it.max(ord)
    val (nextElems, rest) = it.partition(ord.gteq(_, max))
    val maxElems = acc ++ nextElems
    if (maxElems.size >= n || rest.isEmpty) maxElems.take(n)
    else partitionMax(maxElems, rest)
  }
  if (iter.isEmpty) iter.take(0)
  else partitionMax(iter.take(0), iter)
}

This does not sort the entire list and takes an Ordering. I believe every method I call in partitionMaxis O(list size) and I only expect to call it ntimes at most, so the overall efficiency for small nwill be proportional to the size of the iterator.

这不会对整个列表进行排序，而是需要一个Ordering. 我相信我调用的每个方法partitionMax都是 O(list size) 并且我只希望n最多调用它次数，因此 small 的整体效率n将与迭代器的大小成正比。

scala> top(5, List.range(1,1000000))
res13: Iterable[Int] = List(999999, 999998, 999997, 999996, 999995)

scala> top(5, List.range(1,1000000))(Ordering[Int].on(- _))
res14: Iterable[Int] = List(1, 2, 3, 4, 5)

You could also add a branch for when ngets close to size of the iterable, and switch to iter.toList.sortBy(_.myAttr).take(n).

您还可以n在接近可迭代大小时添加一个分支，然后切换到iter.toList.sortBy(_.myAttr).take(n).

It does not return the type of collection provided, but you can look at How do I apply the enrich-my-library pattern to Scala collections?if this is a requirement.

它不返回提供的集合类型，但您可以查看如何将丰富我的库模式应用于 Scala 集合？如果这是一个要求。

获取 Scala Iterable 的前 n 个元素的最简单方法

提问by Stefan Endrullis

Update:

更新：

采纳答案by user unknown

update (abstraction):

更新（抽象）：

回答by shellholic

回答by Tom Wang

回答by oxbow_lakes

回答by thoredge

回答by michid

回答by ponkin

回答by Ravi Teja

回答by huynhjl

相关推荐

最近更新

标签

获取 Scala Iterable 的前 n 个元素的最简单方法

提问by Stefan Endrullis

Update:

更新：

采纳答案by user unknown

update (abstraction):

更新（抽象）：

回答by shellholic

回答by Tom Wang

回答by oxbow_lakes

回答by thoredge

回答by michid

回答by ponkin

回答by Ravi Teja

回答by huynhjl

相关推荐

scala SBT 停止运行而不退出

使用 Scala 进行函数式 try & catch

scala 流 vs 视图 vs 迭代器

scala 你如何在没有硬编码的情况下使用 Cake 模式进行依赖注入？

相关推荐

最近更新

标签