Python 从数据框中获取值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38058950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
get value out of dataframe
提问by M.Rez
In Scala I can do get(#)
or getAs[Type](#)
to get values out of a dataframe. How should I do it in pyspark
?
在 Scala 中,我可以做get(#)
或getAs[Type](#)
从数据框中获取值。我应该怎么做pyspark
?
I have a two columns DataFrame: item(string)
and salesNum(integers)
. I do a groupby
and mean
to get a mean of those numbers like this:
我有两列 DataFrame:item(string)
和salesNum(integers)
. 我做了一个groupby
并mean
得到这些数字的平均值,如下所示:
saleDF.groupBy("salesNum").mean()).collect()
saleDF.groupBy("salesNum").mean()).collect()
and it works. Now I have the mean in a dataframe with one value.
它有效。现在我在一个数据框中有一个值的平均值。
How can I get that value out of the dataframe to get the mean as a float number?
如何从数据框中获取该值以将平均值作为浮点数?
回答by David
collect()
returns your results as a python list. To get the value out of the list you just need to take the first element like this:
collect()
将您的结果作为 python 列表返回。要从列表中获取值,您只需要像这样获取第一个元素:
saleDF.groupBy("salesNum").mean()).collect()[0]
回答by Francesco Boi
To be precise, collect
returns a list whose elements are of type class 'pyspark.sql.types.Row'
.
准确地说,collect
返回一个元素类型为 的列表class 'pyspark.sql.types.Row'
。
In your case to extract the real value you should do:
在您提取实际价值的情况下,您应该执行以下操作:
saleDF.groupBy("salesNum").mean()).collect()[0]["avg(yourColumnName)"]
where yourColumnName
is the name of the column you are taking the mean of (pyspark, when applying mean, renames the resulting column in this way by default).
其中yourColumnName
是您要取平均值的列的名称(pyspark,在应用平均值时,默认情况下以这种方式重命名结果列)。
As an example, I ran the following code. Look at the types and outputs of each step.
例如,我运行了以下代码。查看每个步骤的类型和输出。
>>> columns = ['id', 'dogs', 'cats', 'nation']
>>> vals = [
... (2, 0, 1, 'italy'),
... (1, 2, 0, 'italy'),
... (3, 4, 0, 'france')
... ]
>>> df = sqlContext.createDataFrame(vals, columns)
>>> df.groupBy("nation").mean("dogs").collect()
[Row(nation=u'france', avg(dogs)=4.0), Row(nation=u'italy', avg(dogs)=1.0)]
>>> df.groupBy("nation").mean("dogs").collect()[0]
Row(nation=u'france', avg(dogs)=4.0))
>>> df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"]
4.0
>>> type(df.groupBy("nation").mean("dogs").collect())
<type 'list'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0])
<class 'pyspark.sql.types.Row'>
>>> type(df.groupBy("nation").mean("dogs").collect()[0]["avg(dogs)"])
<type 'float'>
>>>
>>>