Python PySpark - 对数据框中的一列求和并将结果作为 int 返回
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47812526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PySpark - Sum a column in dataframe and return results as int
提问by Bryce Ramgovind
I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
我有一个带有一列数字的 pyspark 数据框。我需要对该列求和,然后将结果作为 python 变量中的 int 返回。
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
我执行以下操作来对列求和。
df.groupBy().sum()
But I get a dataframe back.
但是我得到了一个数据框。
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
我将 130 作为存储在变量中的 int 返回,以便在程序中的其他地方使用。
result = 130
回答by Olivier Darrouzet
I think the simplest way:
我认为最简单的方法:
df.groupBy().sum().collect()
will return a list. In your example:
将返回一个列表。在你的例子中:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130
回答by Aron Asztalos
The simplest way really :
最简单的方法真的:
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
但它的操作很慢:避免 groupByKey,你应该使用 RDD 和 reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
我尝试了一个更大的数据集,并测量了处理时间:
RDD and ReduceByKey : 2.23 s
RDD 和 ReduceByKey:2.23 秒
GroupByKey: 30.5 s
GroupByKey:30.5 秒
回答by Ali AzG
This is another way you can do this. using agg
and collect
:
这是您可以执行此操作的另一种方法。使用agg
和collect
:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]
回答by LaSul
If you want a specific column :
如果你想要一个特定的列:
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]
回答by seasee my
sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
有时读取 csv 文件到 pyspark Dataframe,也许数字列更改为字符串类型 '23',像这样,您应该使用 pyspark.sql.functions.sum 将结果作为 int ,而不是 sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()
回答by ags29
The following should work:
以下应该工作:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()