Python 如何从 PySpark DataFrame 中随机抽取一行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34003314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How take a random row from a PySpark DataFrame?
提问by DanT
How can I get a random row from a PySpark DataFrame? I only see the method sample()
which takes a fraction as parameter. Setting this fraction to 1/numberOfRows
leads to random results, where sometimes I won't get any row.
如何从 PySpark DataFrame 中获取随机行?我只看到将sample()
分数作为参数的方法。将此分数设置为1/numberOfRows
会导致随机结果,有时我不会得到任何行。
On RRD
there is a method takeSample()
that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?
在RRD
有一种方法takeSample()
是作为一个参数,你想要的样品包含元素的数量。我知道这可能很慢,因为您必须计算每个分区,但是有没有办法在 DataFrame 上获得这样的东西?
采纳答案by zero323
You can simply call takeSample
on a RDD
:
您可以简单地调用takeSample
一个RDD
:
df = sqlContext.createDataFrame(
[(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]
If you don't want to collect you can simply take a higher fraction and limit:
如果您不想收集,您可以简单地采用更高的分数和限制:
df.sample(False, 0.1, seed=0).limit(1)