python，pyspark：获取pyspark数据帧列值的总和

Question

提问by Satya

say I have a dataframe like this

说我有一个这样的数据框

name age city
abc   20  A
def   30  B

i want to add a summary row at the end of the dataframe, so result will be like

我想在数据框的末尾添加一个摘要行，所以结果会像

name age city
abc   20  A
def   30  B
All   50  All

So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable

所以字符串'All'，我可以很容易地放置，但是如何获得 sum(df['age']) ###column 对象是不可迭代的

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
 #|-- name: string (nullable = true)
 #|-- age: long (nullable = true)
 #|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns))  ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error.   If i am using [('All',50,'All')], it is doing fine.

I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.

我通常在 Pandas 数据框上工作，并且是 Spark 的新手。可能是我对火花数据框的不了解还没有那么成熟。

Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.

请建议，如何获得 pyspark 中数据框列的总和。如果有更好的方法将一行添加/附加到数据帧的末尾。谢谢。

Answer 1

回答by swenzel

Spark SQL has a dedicated module for column functions pyspark.sql.functions.
So the way it works is:

Spark SQL 有一个专门的列函数模块pyspark.sql.functions。
所以它的工作方式是：

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

Prints:

印刷：

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+

Answer 2

回答by GwydionFR

A dataframe is immutable, you need to create a new one. To get the sum of your age, you can use this function: data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

数据框是不可变的，您需要创建一个新的。要获得您的年龄总和，您可以使用此函数：data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

The way you add a row is fine, but why would you do such a thing? Your dataframe will be hard to manipulate and you wont be able to use aggregations functions unless you drop the last line.

添加行的方式很好，但是为什么要这样做呢？您的数据框将难以操作，除非您删除最后一行，否则您将无法使用聚合函数。

python，pyspark：获取pyspark数据帧列值的总和

提问by Satya

回答by swenzel

回答by GwydionFR

相关推荐

最近更新

标签

python，pyspark：获取pyspark数据帧列值的总和

提问by Satya

回答by swenzel

回答by GwydionFR

相关推荐

Python Keras.io.preprocessing.sequence.pad_sequences 有什么作用？

Python Airflow：如何通过 SSH 和从不同的服务器运行 BashOperator

Python 类型错误：zip 参数 #2 必须支持迭代

Python 凯拉斯 | 类型错误：__init__() 缺少 1 个必需的位置参数：'nb_col'

相关推荐

最近更新

标签

Python 凯拉斯 | 类型错误：init() 缺少 1 个必需的位置参数：'nb_col'