scala 在 Spark 中获取 DataFrame 列的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46348617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting the value of a DataFrame column in Spark
提问by Ayan Biswas
I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
我正在尝试检索 DataFrame 列的值并将其存储在变量中。我试过这个:
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
但以上都没有返回列“名称”的值。
Spark version :2.2.0 Scala version :2.11.11
Spark 版本:2.2.0 Scala 版本:2.11.11
回答by Avishek Bhattacharya
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
这里有几件事。如果你想看到所有的数据收集是要走的路。但是,如果您的数据太大,则会导致驱动器出现故障。
So the alternate is to check few items from the dataframe. What I generally do is
所以另一种方法是检查数据框中的几个项目。我通常做的是
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
这将提供 10 个元素的输出。但现在输出看起来不太好
So, 2nd alternative is
所以,第二个选择是
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
这将打印前 10 个元素,有时如果列值很大,它通常会放置“...”而不是令人讨厌的实际值。
Hence there is third option
因此有第三种选择
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
取 10 个元素并打印它们。
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
现在,在所有情况下,您都不会获得公平的数据样本,因为将选取前 10 个数据。因此,要真正从数据帧中随机选取,您可以使用
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
您可以检查数据帧上的“样本”功能
回答by T. Gaw?da
The first will do :)
第一个会做:)
val name = df.select("name")will return another DataFrame. You can do for example name.show()to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
val name = df.select("name")将返回另一个数据帧。例如name.show(),您可以显示 DataFrame 的内容。您也可以使用 collect 或 collectAsMap 将结果具体化到驱动程序上,但请注意,驱动程序的数据量不应太大
You can also do:
你也可以这样做:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame
这将返回此 DataFrame 中的名称数组

