Python pyspark中groupBy之后的列别名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33516490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:29:54  来源:igfitidea点击:

Column alias after groupBy in pyspark

pythonscalaapache-sparkpysparkapache-spark-sql

提问by mhn

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.

我需要下一行中的结果数据框,以便在 groupBy 之后为 max('diff') 列使用别名“maxDiff”。但是,下面的行不会进行任何更改,也不会引发错误。

 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")

采纳答案by Nhor

This is because you are aliasing the whole DataFrameobject, not Column. Here's an example how to alias the Columnonly:

这是因为您正在为整个DataFrame对象设置别名,而不是Column. 以下是如何为Columnonly设置别名的示例:

import pyspark.sql.functions as func

grpdf = joined_df \
    .groupBy(temp1.datestamp) \
    .max('diff') \
    .select(func.col("max(diff)").alias("maxDiff"))

回答by zero323

You can use agginstead of calling maxmethod:

您可以使用agg代替调用max方法:

from pyspark.sql.functions import max

joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))

Similarly in Scala

同样在 Scala 中

import org.apache.spark.sql.functions.max

joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))

or

或者

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))

回答by vk1011

In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions:

除了此处已有的答案之外,如果您知道聚合列的名称,则以下也是方便的方法,您不必从中导入pyspark.sql.functions

1

1

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .selectExpr('max(diff) AS maxDiff')

See docsfor info on .selectExpr()

有关信息,请参阅文档.selectExpr()

2

2

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .withColumnRenamed('max(diff)', 'maxDiff')

See docsfor info on .withColumnRenamed()

有关信息,请参阅文档.withColumnRenamed()

This answer here goes into more detail: https://stackoverflow.com/a/34077809

这里的答案更详细:https: //stackoverflow.com/a/34077809

回答by Nilay Bhardwaj

you can use.

您可以使用。

grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()