Python PySpark - 对数据框中的一列求和并将结果作为 int 返回

Question

提问by Bryce Ramgovind

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

我有一个带有一列数字的 pyspark 数据框。我需要对该列求和，然后将结果作为 python 变量中的 int 返回。

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.

我执行以下操作来对列求和。

df.groupBy().sum()

But I get a dataframe back.

但是我得到了一个数据框。

+-----------+
|sum(Number)|
+-----------+
|        130|
+-----------+

I would 130 returned as an int stored in a variable to be used else where in the program.

我将 130 作为存储在变量中的 int 返回，以便在程序中的其他地方使用。

result = 130

Answer 1

回答by Olivier Darrouzet

I think the simplest way:

我认为最简单的方法：

df.groupBy().sum().collect()

will return a list. In your example:

将返回一个列表。在你的例子中：

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

Answer 2

回答by Aron Asztalos

The simplest way really :

最简单的方法真的：

df.groupBy().sum().collect()

But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

但它的操作很慢：避免 groupByKey，你应该使用 RDD 和 reduceByKey：

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:

我尝试了一个更大的数据集，并测量了处理时间：

RDD and ReduceByKey : 2.23 s

RDD 和 ReduceByKey：2.23 秒

GroupByKey: 30.5 s

GroupByKey：30.5 秒

Answer 3

回答by Ali AzG

This is another way you can do this. using aggand collect:

这是您可以执行此操作的另一种方法。使用agg和collect：

sum_number = df.agg({"Number":"sum"}).collect()[0]

result = sum_number["sum(Number)"]

Answer 4

回答by LaSul

If you want a specific column :

如果你想要一个特定的列：

import pyspark.sql.functions as F     

df.agg(F.sum("my_column")).collect()[0][0]

Answer 5

回答by seasee my

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()

有时读取 csv 文件到 pyspark Dataframe，也许数字列更改为字符串类型 '23'，像这样，您应该使用 pyspark.sql.functions.sum 将结果作为 int ，而不是 sum()

import pyspark.sql.functions as F                                                    
df.groupBy().agg(F.sum('Number')).show()

Answer 6

回答by ags29

The following should work:

以下应该工作：

df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

Python PySpark - 对数据框中的一列求和并将结果作为 int 返回

提问by Bryce Ramgovind

回答by Olivier Darrouzet

回答by Aron Asztalos

回答by Ali AzG

回答by LaSul

回答by seasee my

回答by ags29

相关推荐

最近更新

标签

Python PySpark - 对数据框中的一列求和并将结果作为 int 返回

提问by Bryce Ramgovind

回答by Olivier Darrouzet

回答by Aron Asztalos

回答by Ali AzG

回答by LaSul

回答by seasee my

回答by ags29

相关推荐

Python OpenCV 错误：“类型错误：图像数据无法转换为浮点数”

python中带有变量的新行

在 Python3.6 中安装 urllib

Python 使用 boto3 对 dynamoDb 进行完整扫描

相关推荐

最近更新

标签