如何从 Scala 列表或数组中随机抽样?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32932229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to randomly sample from a Scala list or array?
提问by Carter
I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.
我想从 Scala 列表或数组(不是 RDD)中随机采样,样本大小可能比列表或数组的长度长得多,我该如何有效地做到这一点?因为样本量可能非常大,并且采样(在不同的列表/数组上)需要进行很多次。
I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?
我知道对于 Spark RDD,我们可以使用 takeSample() 来做到这一点,Scala 列表/数组是否有等价物?
Thank you very much.
非常感谢你。
回答by Marius Soutier
An easy-to-understand version would look like this:
一个易于理解的版本如下所示:
import scala.util.Random
Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)
// Seeded version
val r = new Random(seed)
r.shuffle(...)
回答by Felix
For arrays:
对于数组:
import scala.util.Random
import scala.reflect.ClassTag
def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
val rnd = new Random(seed)
Array.fill(n)(a(rnd.nextInt(a.size)))
}
Make a random number generator (rnd) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.
rnd根据您的种子制作一个随机数生成器 ( )。然后,用从 0 到数组大小的随机数填充数组。
The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:
最后一步是将每个随机值应用于输入数组的索引运算符。在 REPL 中使用它可能如下所示:
scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)
scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)
For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.
对于列表,我只需将列表转换为 Array 并使用相同的函数。我怀疑无论如何你都能更有效地处理列表。
It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time
需要注意的是,使用列表的相同函数将花费 O(n^2) 时间,而首先将列表转换为数组将花费 O(n) 时间
回答by KevinKatz
If you want to sample withoutreplacement -- zip with randoms, sort O(n*log(n), discard randoms, take
如果你想在不替换的情况下采样——用随机数压缩,排序O(n*log(n),丢弃随机数,取
import scala.util.Random
val l = Seq("a", "b", "c", "d", "e")
val ran = l.map(x => (Random.nextFloat(), x))
.sortBy(_._1)
.map(_._2)
.take(3)
回答by elm
Using a for comprehension, for a given array xsas follows,
使用 a for comprehension,对于给定的数组xs如下,
for (i <- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r)
Note the random generator here produces values within the unit interval, which are scaled to range over the size of the array, and converted to Intfor indexing over the array.
请注意,此处的随机生成器会在单位间隔内生成值,这些值会缩放到数组大小的范围内,并转换Int为在数组上进行索引。
NoteFor pure functional random generator consider for instance the State Monad approach from Functional Programming in Scala, discussed here.
注意对于纯函数式随机生成器,请考虑例如Scala 中的函数式编程中的 State Monad 方法,此处讨论。
NoteConsider also NICTA, another pure functional random value generator, it's use illustrated for instance here.
回答by thomas pocreau
Using classical recursion.
使用经典递归。
import scala.util.Random
def takeSample[T](a: List[T], n: Int): List[T] = {
n match {
case n: Int if n <= 0 => List.empty[T]
case n: Int => a(Random.nextInt(a.size)) :: takeSample(a, n - 1)
}
}
回答by Darren Bishop
package your.pkg
import your.pkg.SeqHelpers.SampleOps
import scala.collection.generic.CanBuildFrom
import scala.collection.mutable
import scala.language.{higherKinds, implicitConversions}
import scala.util.Random
trait SeqHelpers {
implicit def withSampleOps[E, CC[_] <: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc)
}
object SeqHelpers extends SeqHelpers {
case class SampleOps[E, CC[_] <: Seq[_]](cc: CC[_]) {
private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match {
case 0 => builder.result
case _ =>
val element = cc(Random.nextInt(cc.size)).asInstanceOf[E]
recurse(n - 1, builder += element)
}
def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = {
require(n >= 0, "Cannot take less than 0 samples")
recurse(n, cbf.apply)
}
}
}
Either:
任何一个:
- Mixin
SeqHelpers, for example, with a Scalatest spec - Include
import your.pkg.SeqHelpers._
SeqHelpers例如,带有 Scalatest 规范的Mixin- 包括
import your.pkg.SeqHelpers._
Then the following should work:
那么以下应该工作:
Seq(1 to 100: _*) sample 10 foreach { println }
Edits to remove the cast are welcome.
欢迎编辑删除演员表。
Also if there is a way to create an empty instance of the collection for the accumulator, without knowing the concrete type ahead of time, please comment. That said, the builder is probably more efficient.
此外,如果有一种方法可以为累加器创建一个空的集合实例,而无需提前知道具体类型,请发表评论。也就是说,构建器可能更有效率。
回答by ruhsuzbaykus
Did not test for performance, but the following code is a simple and elegant way to do the sampling and I believe can help many that come here just to get a sampling code. Just change the "range" according to the size of your end sample. If pseude-randomness is not enough for your need, you can use take(1) in the inner list and increase the range.
没有测试性能,但以下代码是一种简单而优雅的采样方式,我相信可以帮助许多来这里只是为了获取采样代码的人。只需根据最终样本的大小更改“范围”。如果伪随机性不足以满足您的需要,您可以在内部列表中使用 take(1) 并增加范围。
Random.shuffle((1 to 100).toList.flatMap(x => (Random.shuffle(yourList))))
Random.shuffle((1 to 100).toList.flatMap(x => (Random.shuffle(yourList))))

