如何从 Scala 列表或数组中随机抽样？

Question

提问by Carter

I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.

我想从 Scala 列表或数组（不是 RDD）中随机采样，样本大小可能比列表或数组的长度长得多，我该如何有效地做到这一点？因为样本量可能非常大，并且采样（在不同的列表/数组上）需要进行很多次。

I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?

我知道对于 Spark RDD，我们可以使用 takeSample() 来做到这一点，Scala 列表/数组是否有等价物？

Thank you very much.

非常感谢你。

Answer 1

回答by Marius Soutier

An easy-to-understand version would look like this:

一个易于理解的版本如下所示：

import scala.util.Random

Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)

// Seeded version
val r = new Random(seed)
r.shuffle(...)

Answer 2

回答by Felix

For arrays:

对于数组：

import scala.util.Random
import scala.reflect.ClassTag

def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
  val rnd = new Random(seed)
  Array.fill(n)(a(rnd.nextInt(a.size)))
}

Make a random number generator (rnd) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.

rnd根据您的种子制作一个随机数生成器 ( )。然后，用从 0 到数组大小的随机数填充数组。

The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:

最后一步是将每个随机值应用于输入数组的索引运算符。在 REPL 中使用它可能如下所示：

scala> val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)

scala> takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)

For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.

对于列表，我只需将列表转换为 Array 并使用相同的函数。我怀疑无论如何你都能更有效地处理列表。

It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time

需要注意的是，使用列表的相同函数将花费 O(n^2) 时间，而首先将列表转换为数组将花费 O(n) 时间

Answer 3

回答by KevinKatz

If you want to sample withoutreplacement -- zip with randoms, sort O(n*log(n), discard randoms, take

如果你想在不替换的情况下采样——用随机数压缩，排序O(n*log(n)，丢弃随机数，取

import scala.util.Random
val l = Seq("a", "b", "c", "d", "e")
val ran = l.map(x => (Random.nextFloat(), x))
  .sortBy(_._1)
  .map(_._2)
  .take(3)

Answer 4

回答by elm

Using a for comprehension, for a given array xsas follows,

使用 a for comprehension，对于给定的数组xs如下，

for (i <- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r)

Note the random generator here produces values within the unit interval, which are scaled to range over the size of the array, and converted to Intfor indexing over the array.

请注意，此处的随机生成器会在单位间隔内生成值，这些值会缩放到数组大小的范围内，并转换Int为在数组上进行索引。

NoteFor pure functional random generator consider for instance the State Monad approach from Functional Programming in Scala, discussed here.

注意对于纯函数式随机生成器，请考虑例如Scala 中的函数式编程中的 State Monad 方法，此处讨论。

NoteConsider also NICTA, another pure functional random value generator, it's use illustrated for instance here.

注想想也是NICTA，另一种纯功能性的随机值发生器，例如说明它的使用在这里。

Answer 5

回答by thomas pocreau

Using classical recursion.

使用经典递归。

import scala.util.Random

def takeSample[T](a: List[T], n: Int): List[T] = {
    n match {
      case n: Int if n <= 0 => List.empty[T]
      case n: Int => a(Random.nextInt(a.size)) :: takeSample(a, n - 1)
    }
}

Answer 6

回答by Darren Bishop

package your.pkg

import your.pkg.SeqHelpers.SampleOps

import scala.collection.generic.CanBuildFrom
import scala.collection.mutable
import scala.language.{higherKinds, implicitConversions}
import scala.util.Random

trait SeqHelpers {

  implicit def withSampleOps[E, CC[_] <: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc)
}

object SeqHelpers extends SeqHelpers {

  case class SampleOps[E, CC[_] <: Seq[_]](cc: CC[_]) {

    private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match {
      case 0 => builder.result
      case _ =>
        val element = cc(Random.nextInt(cc.size)).asInstanceOf[E]
        recurse(n - 1, builder += element)
    }

    def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = {
      require(n >= 0, "Cannot take less than 0 samples")
      recurse(n, cbf.apply)
    }
  }
}

Either:

任何一个：

Mixin SeqHelpers, for example, with a Scalatest spec
Include import your.pkg.SeqHelpers._

SeqHelpers例如，带有 Scalatest 规范的Mixin
包括 import your.pkg.SeqHelpers._

Then the following should work:

那么以下应该工作：

Seq(1 to 100: _*) sample 10 foreach { println }

Edits to remove the cast are welcome.

欢迎编辑删除演员表。

Also if there is a way to create an empty instance of the collection for the accumulator, without knowing the concrete type ahead of time, please comment. That said, the builder is probably more efficient.

此外，如果有一种方法可以为累加器创建一个空的集合实例，而无需提前知道具体类型，请发表评论。也就是说，构建器可能更有效率。

Answer 7

回答by ruhsuzbaykus

Did not test for performance, but the following code is a simple and elegant way to do the sampling and I believe can help many that come here just to get a sampling code. Just change the "range" according to the size of your end sample. If pseude-randomness is not enough for your need, you can use take(1) in the inner list and increase the range.

没有测试性能，但以下代码是一种简单而优雅的采样方式，我相信可以帮助许多来这里只是为了获取采样代码的人。只需根据最终样本的大小更改“范围”。如果伪随机性不足以满足您的需要，您可以在内部列表中使用 take(1) 并增加范围。

Random.shuffle((1 to 100).toList.flatMap(x => (Random.shuffle(yourList))))

如何从 Scala 列表或数组中随机抽样？

提问by Carter

回答by Marius Soutier

回答by Felix

回答by KevinKatz

回答by elm

回答by thomas pocreau

回答by Darren Bishop

回答by ruhsuzbaykus

相关推荐

最近更新

标签

如何从 Scala 列表或数组中随机抽样？

提问by Carter

回答by Marius Soutier

回答by Felix

回答by KevinKatz

回答by elm

回答by thomas pocreau

回答by Darren Bishop

回答by ruhsuzbaykus

相关推荐

Spark：使用 scala 从 s3 读取 csv 文件

scala 如何记录 Akka HTTP 客户端请求

scala `map` 和 `reduce` 方法如何在 Spark RDD 中工作？

如何在 Scala 中将 DataFrame 导出到 csv？

相关推荐

最近更新

标签