获取 Scala Iterable 的前 n 个元素的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5674741/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simplest way to get the top n elements of a Scala Iterable
提问by Stefan Endrullis
Is there a simple and efficient solution to determine the top n elements of a Scala Iterable? I mean something like
是否有一个简单有效的解决方案来确定 Scala Iterable 的前 n 个元素?我的意思是像
iter.toList.sortBy(_.myAttr).take(2)
but without having to sort all elements when only the top 2 are of interest. Ideally I'm looking for something like
但是当只对前 2 个元素感兴趣时不必对所有元素进行排序。理想情况下,我正在寻找类似的东西
iter.top(2, _.myAttr)
see also: Solution for the top element using an Ordering: In Scala, how to use Ordering[T] with List.min or List.max and keep code readable
另请参阅:使用排序的顶部元素的解决方案:在 Scala 中,如何使用带有 List.min 或 List.max 的 Ordering[T] 并保持代码可读
Update:
更新:
Thank you all for your solutions. Finally, I took the original solution of user unknownand adopted it to use Iterableand the pimp-my-librarypattern:
谢谢大家的解决方案。最后,我采用了user unknown的原始解决方案并采用了它Iterable和pimp-my-library模式:
implicit def iterExt[A](iter: Iterable[A]) = new {
def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]): List[A] = {
def updateSofar (sofar: List [A], el: A): List [A] = {
//println (el + " - " + sofar)
if (ord.compare(f(el), f(sofar.head)) > 0)
(el :: sofar.tail).sortBy (f)
else sofar
}
val (sofar, rest) = iter.splitAt(n)
(sofar.toList.sortBy (f) /: rest) (updateSofar (_, _)).reverse
}
}
case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, _.i))
采纳答案by user unknown
My solution (bound to Int, but should be easily changed to Ordered (a few minutes please):
我的解决方案(绑定到 Int,但应该很容易更改为 Ordered(请几分钟):
def top (n: Int, li: List [Int]) : List[Int] = {
def updateSofar (sofar: List [Int], el: Int) : List [Int] = {
// println (el + " - " + sofar)
if (el < sofar.head)
(el :: sofar.tail).sortWith (_ > _)
else sofar
}
/* better readable:
val sofar = li.take (n).sortWith (_ > _)
val rest = li.drop (n)
(sofar /: rest) (updateSofar (_, _)) */
(li.take (n). sortWith (_ > _) /: li.drop (n)) (updateSofar (_, _))
}
usage:
用法:
val li = List (4, 3, 6, 7, 1, 2, 9, 5)
top (2, li)
- For above list, take the first 2 (4, 3) as starting TopTen (TopTwo).
- Sort them, such that the first element is the bigger one (if any).
- repeatedly iterate through the rest of the list (li.drop(n)), and compare the current element with the maximum of the list of minimums; replace, if neccessary, and resort again.
- Improvements:
- Throw away Int, and use ordered.
- Throw away (_ > _) and use a user-Ordering to allow BottomTen. (Harder: pick the middle 10 :) )
- Throw away List, and use Iterable instead
- 对于上面的列表,取前 2 (4, 3) 个作为起始 TopTen (TopTwo)。
- 对它们进行排序,以便第一个元素是较大的元素(如果有)。
- 重复遍历列表的其余部分 (li.drop(n)),并将当前元素与最小值列表的最大值进行比较;如有必要,更换并再次使用。
- 改进:
- 扔掉Int,使用ordered。
- 扔掉 (_ > _) 并使用用户排序来允许BottomTen。(更难:选择中间的 10 个 :) )
- 扔掉 List,改用 Iterable
update (abstraction):
更新(抽象):
def extremeN [T](n: Int, li: List [T])
(comp1: ((T, T) => Boolean), comp2: ((T, T) => Boolean)):
List[T] = {
def updateSofar (sofar: List [T], el: T) : List [T] =
if (comp1 (el, sofar.head))
(el :: sofar.tail).sortWith (comp2 (_, _))
else sofar
(li.take (n) .sortWith (comp2 (_, _)) /: li.drop (n)) (updateSofar (_, _))
}
/* still bound to Int:
def top (n: Int, li: List [Int]) : List[Int] = {
extremeN (n, li) ((_ < _), (_ > _))
}
def bottom (n: Int, li: List [Int]) : List[Int] = {
extremeN (n, li) ((_ > _), (_ < _))
}
*/
def top [T] (n: Int, li: List [T])
(implicit ord: Ordering[T]): Iterable[T] = {
extremeN (n, li) (ord.lt (_, _), ord.gt (_, _))
}
def bottom [T] (n: Int, li: List [T])
(implicit ord: Ordering[T]): Iterable[T] = {
extremeN (n, li) (ord.gt (_, _), ord.lt (_, _))
}
top (3, li)
bottom (3, li)
val sl = List ("Haus", "Garten", "Boot", "Sumpf", "X", "y", "xkcd", "x11")
bottom (2, sl)
To replace List with Iterable seems to be a bit harder.
用 Iterable 替换 List 似乎有点困难。
As Daniel C. Sobral pointed out in the comments, a high nin topN can lead to much sorting work, so that it could be useful, to do a manual insertion sort instead of repeatedly sorting the whole list of top-n elements:
正如 Daniel C. Sobral 在评论中指出的那样,ntopN 中的高会导致大量排序工作,因此执行手动插入排序而不是重复排序整个 top-n 元素列表可能很有用:
def extremeN [T](n: Int, li: List [T])
(comp1: ((T, T) => Boolean), comp2: ((T, T) => Boolean)):
List[T] = {
def sortedIns (el: T, list: List[T]): List[T] =
if (list.isEmpty) List (el) else
if (comp2 (el, list.head)) el :: list else
list.head :: sortedIns (el, list.tail)
def updateSofar (sofar: List [T], el: T) : List [T] =
if (comp1 (el, sofar.head))
sortedIns (el, sofar.tail)
else sofar
(li.take (n) .sortWith (comp2 (_, _)) /: li.drop (n)) (updateSofar (_, _))
}
top/bottom method and usage as above. For small groups of top/bottom Elements, the sorting is rarely called, a few times in the beginning, and then less and less often over time. For example, 70 times with top (10) of 10 000, and 90 times with top (10) of 100 000.
顶部/底部方法和用法同上。对于一小组顶部/底部元素,很少会调用排序,开始时会调用几次,然后随着时间的推移越来越少。例如,顶部 (10) 为 10 000 时为 70 次,顶部 (10) 为 100 000 时为 90 次。
回答by shellholic
Yet another version:
还有一个版本:
val big = (1 to 100000)
def maxes[A](n:Int)(l:Traversable[A])(implicit o:Ordering[A]) =
l.foldLeft(collection.immutable.SortedSet.empty[A]) { (xs,y) =>
if (xs.size < n) xs + y
else {
import o._
val first = xs.firstKey
if (first < y) xs - first + y
else xs
}
}
println(maxes(4)(big))
println(maxes(2)(List("a","ab","c","z")))
Using the Setforce the list to have unique values:
使用Set强制列表具有唯一值:
def maxes2[A](n:Int)(l:Traversable[A])(implicit o:Ordering[A]) =
l.foldLeft(List.empty[A]) { (xs,y) =>
import o._
if (xs.size < n) (y::xs).sort(lt _)
else {
val first = xs.head
if (first < y) (y::(xs - first)).sort(lt _)
else xs
}
}
回答by Tom Wang
Here's another solution that is simple and has pretty good performance.
这是另一个简单且性能非常好的解决方案。
def pickTopN[T](k: Int, iterable: Iterable[T])(implicit ord: Ordering[T]): Seq[T] {
val q = collection.mutable.PriorityQueue[T](iterable.toSeq:_*)
val end = Math.min(k, q.size)
(1 to end).map(_ => q.dequeue())
}
The Big O is O(n + k log n), where k <= n. So the performance is linear for small kand at worst n log n.
大 O 是O(n + k log n)哪里k <= n。因此,对于 smallk和最坏的情况,性能是线性的n log n。
The solution can also be optimized to be O(k)for memory but O(n log k)for performance. The idea is to use a MinHeap to track only the top kitems at all times. Here's the solution.
该解决方案还可以针对O(k)内存O(n log k)和性能进行优化。这个想法是使用 MinHeap 始终只跟踪顶部的k项目。这是解决方案。
def pickTopN[A, B](n: Int, iterable: Iterable[A], f: A => B)(implicit ord: Ordering[B]): Seq[A] = {
val seq = iterable.toSeq
val q = collection.mutable.PriorityQueue[A](seq.take(n):_*)(ord.on(f).reverse) // initialize with first n
// invariant: keep the top k scanned so far
seq.drop(n).foreach(v => {
q += v
q.dequeue()
})
q.dequeueAll.reverse
}
回答by oxbow_lakes
You don't need to sort the entire collection in order to determine the top N elements. However, I don't believe that this functionality is supplied by the raw library, so you would have to roll you own, possibly using the pimp-my-librarypattern.
您不需要对整个集合进行排序来确定前 N 个元素。但是,我不相信这个功能是由原始库提供的,所以你必须自己动手,可能使用pimp-my-library模式。
For example, you can get the nth element of a collection as follows:
例如,您可以按如下方式获取集合的第 n 个元素:
class Pimp[A, Repr <% TraversableLike[A, Repr]](self : Repr) {
def nth(n : Int)(implicit ord : Ordering[A]) : A = {
val trav : TraversableLike[A, Repr] = self
var ltp : List[A] = Nil
var etp : List[A] = Nil
var mtp : List[A] = Nil
trav.headOption match {
case None => error("Cannot get " + n + " element of empty collection")
case Some(piv) =>
trav.foreach { a =>
val cf = ord.compare(piv, a)
if (cf == 0) etp ::= a
else if (cf > 0) ltp ::= a
else mtp ::= a
}
if (n < ltp.length)
new Pimp[A, List[A]](ltp.reverse).nth(n)(ord)
else if (n < (ltp.length + etp.length))
piv
else
new Pimp[A, List[A]](mtp.reverse).nth(n - ltp.length - etp.length)(ord)
}
}
}
(This is not very functional; sorry)
(这不是很实用;抱歉)
It's then trivial to get the top nelements:
然后获取顶级n元素很简单:
def topN(n : Int)(implicit ord : Ordering[A], bf : CanBuildFrom[Repr, A, Repr]) ={
val b = bf()
val elem = new Pimp[A, Repr](self).nth(n)(ord)
import util.control.Breaks._
breakable {
var soFar = 0
self.foreach { tt =>
if (ord.compare(tt, elem) < 0) {
b += tt
soFar += 1
}
}
assert (soFar <= n)
if (soFar < n) {
self.foreach { tt =>
if (ord.compare(tt, elem) == 0) {
b += tt
soFar += 1
}
if (soFar == n) break
}
}
}
b.result()
}
Unfortunately I'm having trouble getting this pimp to be discovered via this implicit:
不幸的是,我无法通过这种隐含的方式发现这个皮条客:
implicit def t2n[A, Repr <% TraversableLike[A, Repr]](t : Repr) : Pimp[A, Repr]
= new Pimp[A, Repr](t)
I get this:
我明白了:
scala> List(4, 3, 6, 7, 1, 2, 8, 5).topN(4)
<console>:9: error: could not find implicit value for evidence parameter of type (List[Int]) => scala.collection.TraversableLike[A,List[Int]]
List(4, 3, 6, 7, 1, 2, 8, 5).topN(4)
^
However, the code actually works OK:
但是,代码实际上工作正常:
scala> new Pimp(List(4, 3, 6, 7, 1, 2, 8, 5)).topN(4)
res3: List[Int] = List(3, 1, 2, 4)
And
和
scala> new Pimp("ioanusdhpisjdmpsdsvfgewqw").topN(6)
res2: java.lang.String = adddfe
回答by thoredge
If the goal is to not sort the whole list then you could do something like this (of course it could be optimized a tad so that we don't change the list when the number clearly shouldn't be there):
如果目标是不对整个列表进行排序,那么您可以执行以下操作(当然可以稍微优化一下,以便在数字显然不应该存在时我们不会更改列表):
List(1,6,3,7,3,2).foldLeft(List[Int]()){(l, n) => (n :: l).sorted.take(2)}
回答by michid
I implemented such an ranking algorithm recently in the Rankclass of Apache Hymanrabbit (in Java though). See the takemethod for the gist of it. The basic idea is to quicksort but terminate prematurely as soon as the top nelements have been found.
我最近在Apache Hymanrabbit的Rank类中实现了这样的排名算法(虽然是在 Java 中)。请参阅take方法以了解其要点。基本思想是快速排序,但一旦n找到顶部元素就过早终止。
回答by ponkin
Here is asymptotically O(n)solution.
这是渐近O(n)解。
def top[T](data: List[T], n: Int)(implicit ord: Ordering[T]): List[T] = {
require( n < data.size)
def partition_inner(shuffledData: List[T], pivot: T): List[T] =
shuffledData.partition( e => ord.compare(e, pivot) > 0 ) match {
case (left, right) if left.size == n => left
case (left, x :: rest) if left.size < n =>
partition_inner(util.Random.shuffle(data), x)
case (left @ y :: rest, right) if left.size > n =>
partition_inner(util.Random.shuffle(data), y)
}
val shuffled = util.Random.shuffle(data)
partition_inner(shuffled, shuffled.head)
}
scala> top(List.range(1,10000000), 5)
Due to recursion, this solution will take longer than some non-linear solutions above and can cause java.lang.OutOfMemoryError: GC overhead limit exceeded.
But slightly more readable IMHO and functional style. Just for job interview ;).
由于递归,此解决方案将比上述一些非线性解决方案花费更长的时间,并可能导致java.lang.OutOfMemoryError: GC overhead limit exceeded. 但恕我直言和功能风格稍微更具可读性。仅用于面试;)。
What is more important, that this solution can be easily parallelized.
更重要的是,该解决方案可以轻松并行化。
def top[T](data: List[T], n: Int)(implicit ord: Ordering[T]): List[T] = {
require( n < data.size)
@tailrec
def partition_inner(shuffledData: List[T], pivot: T): List[T] =
shuffledData.par.partition( e => ord.compare(e, pivot) > 0 ) match {
case (left, right) if left.size == n => left.toList
case (left, right) if left.size < n =>
partition_inner(util.Random.shuffle(data), right.head)
case (left, right) if left.size > n =>
partition_inner(util.Random.shuffle(data), left.head)
}
val shuffled = util.Random.shuffle(data)
partition_inner(shuffled, shuffled.head)
}
回答by Ravi Teja
An optimised solution using PriorityQueuewith Time Complexity of O(nlogk). In the approach given in the update, you are sorting the sofarlist every time which is not needed and below it is optimised by using PriorityQueue.
使用PriorityQueue时间复杂度为的优化解决方案O(nlogk)。在更新中给出的方法中,您sofar每次都对不需要的列表进行排序,并使用PriorityQueue.
import scala.language.implicitConversions
import scala.language.reflectiveCalls
import collection.mutable.PriorityQueue
implicit def iterExt[A](iter: Iterable[A]) = new {
def top[B](n: Int, f: A => B)(implicit ord: Ordering[B]) : List[A] = {
def updateSofar (sofar: PriorityQueue[A], el: A): PriorityQueue[A] = {
if (ord.compare(f(el), f(sofar.head)) < 0){
sofar.dequeue
sofar.enqueue(el)
}
sofar
}
val (sofar, rest) = iter.splitAt(n)
(PriorityQueue(sofar.toSeq:_*)( Ordering.by( (x :A) => f(x) ) ) /: rest) (updateSofar (_, _)).dequeueAll.toList.reverse
}
}
case class A(s: String, i: Int)
val li = List (4, 3, 6, 7, 1, 2, 9, 5).map(i => A(i.toString(), i))
println(li.top(3, -_.i))
回答by huynhjl
For small values of nand large lists, getting the top nelements can be implemented by picking out the max element ntimes:
对于小值n和大列表,n可以通过选取最大元素n次数来获取顶部元素:
def top[T](n:Int, iter:Iterable[T])(implicit ord: Ordering[T]): Iterable[T] = {
def partitionMax(acc: Iterable[T], it: Iterable[T]): Iterable[T] = {
val max = it.max(ord)
val (nextElems, rest) = it.partition(ord.gteq(_, max))
val maxElems = acc ++ nextElems
if (maxElems.size >= n || rest.isEmpty) maxElems.take(n)
else partitionMax(maxElems, rest)
}
if (iter.isEmpty) iter.take(0)
else partitionMax(iter.take(0), iter)
}
This does not sort the entire list and takes an Ordering. I believe every method I call in partitionMaxis O(list size) and I only expect to call it ntimes at most, so the overall efficiency for small nwill be proportional to the size of the iterator.
这不会对整个列表进行排序,而是需要一个Ordering. 我相信我调用的每个方法partitionMax都是 O(list size) 并且我只希望n最多调用它次数,因此 small 的整体效率n将与迭代器的大小成正比。
scala> top(5, List.range(1,1000000))
res13: Iterable[Int] = List(999999, 999998, 999997, 999996, 999995)
scala> top(5, List.range(1,1000000))(Ordering[Int].on(- _))
res14: Iterable[Int] = List(1, 2, 3, 4, 5)
You could also add a branch for when ngets close to size of the iterable, and switch to iter.toList.sortBy(_.myAttr).take(n).
您还可以n在接近可迭代大小时添加一个分支,然后切换到iter.toList.sortBy(_.myAttr).take(n).
It does not return the type of collection provided, but you can look at How do I apply the enrich-my-library pattern to Scala collections?if this is a requirement.
它不返回提供的集合类型,但您可以查看如何将丰富我的库模式应用于 Scala 集合?如果这是一个要求。

