Python 从数据框中获取值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38058950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:15:49  来源:igfitidea点击:

get value out of dataframe

pythonpysparktype-conversionapache-spark-sql

提问by M.Rez

In Scala I can do get(#)or getAs[Type](#)to get values out of a dataframe. How should I do it in pyspark?

在 Scala 中,我可以做get(#)getAs[Type](#)从数据框中获取值。我应该怎么做pyspark

I have a two columns DataFrame: item(string)and salesNum(integers). I do a groupbyand meanto get a mean of those numbers like this:

我有两列 DataFrame:item(string)salesNum(integers). 我做了一个groupbymean得到这些数字的平均值,如下所示:

saleDF.groupBy("salesNum").mean()).collect()

saleDF.groupBy("salesNum").mean()).collect()

and it works. Now I have the mean in a dataframe with one value.

它有效。现在我在一个数据框中有一个值的平均值。

How can I get that value out of the dataframe to get the mean as a float number?

如何从数据框中获取该值以将平均值作为浮点数?

回答by David

collect()returns your results as a python list. To get the value out of the list you just need to take the first element like this:

collect()将您的结果作为 python 列表返回。要从列表中获取值,您只需要像这样获取第一个元素:

saleDF.groupBy("salesNum").mean()).collect()[0] 

回答by Francesco Boi

To be precise, collectreturns a list whose elements are of type class 'pyspark.sql.types.Row'.

准确地说,collect返回一个元素类型为 的列表class 'pyspark.sql.types.Row'

In your case to extract the real value you should do:

在您提取实际价值的情况下,您应该执行以下操作:

saleDF.groupBy("salesNum").mean()).collect()[0]["avg(yourColumnName)"]

where yourColumnNameis the name of the column you are taking the mean of (pyspark, when applying mean, renames the resulting column in this way by default).

其中yourColumnName是您要取平均值的列的名称(pyspark,在应用平均值时,默认情况下以这种方式重命名结果列)。

As an example, I ran the following code. Look at the types and outputs of each step.

例如,我运行了以下代码。查看每个步骤的类型和输出。

>>> columns = ['id', 'dogs', 'cats', 'nation']
>>> vals = [
...      (2, 0, 1, 'italy'),
...      (1, 2, 0, 'italy'),
...      (3, 4, 0, 'france')
... ]
>>> df = sqlContext.createDataFrame(vals, columns)
>>> df.groupBy("nation").mean("dogs").collect()
[Row(nation=u'france', avg(dogs)=4.0), Row(nation=u'italy', avg(dogs)=1.0)]
>>> df.groupBy("nation").mean("dogs").collect()[0]
Row(nation=u'france', avg(dogs)=4.0))
>>> df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"]
4.0
>>> type(df.groupBy("nation").mean("dogs").collect())
<type 'list'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0])
<class 'pyspark.sql.types.Row'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"])
<type 'float'>
>>> 
>>>