Java Spark DataFrame - 选择 n 个随机行

Question

提问by lte__

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

我有一个包含数千条记录的数据框，我想随机选择 1000 行到另一个数据框中进行演示。我怎样才能在 Java 中做到这一点？

Thank you!

谢谢！

Answer 1

采纳答案by T. Gaw?da

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

你可以试试sample()方法。不幸的是，您必须给出的不是数字，而是分数。你可以这样写函数：

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

解释：我们必须取一小部分数据。如果我们有 2000 行，而你想得到 100 行，我们必须有 0.5 行。如果您想获得比 DataFrame 中更多的行，您必须获得 1.0.0。limit () 函数被调用以确保四舍五入正常并且您没有得到比指定的更多的行。

Edit: I see in other answer the takeSample method. But remember:

编辑：我在其他答案中看到 takeSample 方法。但要记住：

It'a a method of RDD, not Dataset, so you must do: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample will collect all values.
Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully

这是RDD的一种方法，而不是Dataset，所以你必须这样做： dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()takeSample将收集所有值。
请记住，如果您想获得非常多的行，那么您将遇到 OutOfMemoryError 问题，因为 takeSample 在驱动程序中收集结果。小心使用

Answer 2

回答by apatry

You can shuffle the rows and then take the top ones:

您可以对行进行洗牌，然后取顶部的行：

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)

Java Spark DataFrame - 选择 n 个随机行

提问by lte__

采纳答案by T. Gaw?da

回答by apatry

相关推荐

最近更新

标签

Java Spark DataFrame - 选择 n 个随机行

提问by lte__

采纳答案by T. Gaw?da

回答by apatry

相关推荐

Java HttpServletRequest 的属性字段如何映射到原始 HTTP 请求？

Java 为什么 Spring Boot Application 类需要有 @Configuration 注解？

Java 在 spring 中自动向 JSESSIONID cookie 添加安全标志

Java 异常和错误的区别

相关推荐

最近更新

标签