Java Spark DataFrame - 选择 n 个随机行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39344769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 21:03:33  来源:igfitidea点击:

Spark DataFrame - Select n random rows

javaapache-sparkdataframe

提问by lte__

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

我有一个包含数千条记录的数据框,我想随机选择 1000 行到另一个数据框中进行演示。我怎样才能在 Java 中做到这一点?

Thank you!

谢谢!

采纳答案by T. Gaw?da

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

你可以试试sample()方法。不幸的是,您必须给出的不是数字,而是分数。你可以这样写函数:

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

解释:我们必须取一小部分数据。如果我们有 2000 行,而你想得到 100 行,我们必须有 0.5 行。如果您想获得比 DataFrame 中更多的行,您必须获得 1.0.0。limit () 函数被调用以确保四舍五入正常并且您没有得到比指定的更多的行。

Edit: I see in other answer the takeSample method. But remember:

编辑:我在其他答案中看到 takeSample 方法。但要记住:

  1. It'a a method of RDD, not Dataset, so you must do: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample will collect all values.
  2. Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully
  1. 这是RDD的一种方法,而不是Dataset,所以你必须这样做: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample将收集所有值。
  2. 请记住,如果您想获得非常多的行,那么您将遇到 OutOfMemoryError 问题,因为 takeSample 在驱动程序中收集结果。小心使用

回答by apatry

You can shuffle the rows and then take the top ones:

您可以对行进行洗牌,然后取顶部的行:

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)