Python pyspark中groupBy之后的列别名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33516490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Column alias after groupBy in pyspark
提问by mhn
I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.
我需要下一行中的结果数据框,以便在 groupBy 之后为 max('diff') 列使用别名“maxDiff”。但是,下面的行不会进行任何更改,也不会引发错误。
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
采纳答案by Nhor
This is because you are aliasing the whole DataFrame
object, not Column
. Here's an example how to alias the Column
only:
这是因为您正在为整个DataFrame
对象设置别名,而不是Column
. 以下是如何为Column
only设置别名的示例:
import pyspark.sql.functions as func
grpdf = joined_df \
.groupBy(temp1.datestamp) \
.max('diff') \
.select(func.col("max(diff)").alias("maxDiff"))
回答by zero323
You can use agg
instead of calling max
method:
您可以使用agg
代替调用max
方法:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
同样在 Scala 中
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
或者
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
回答by vk1011
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions
:
除了此处已有的答案之外,如果您知道聚合列的名称,则以下也是方便的方法,您不必从中导入pyspark.sql.functions
:
1
1
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.selectExpr('max(diff) AS maxDiff')
See docsfor info on .selectExpr()
有关信息,请参阅文档.selectExpr()
2
2
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.withColumnRenamed('max(diff)', 'maxDiff')
See docsfor info on .withColumnRenamed()
有关信息,请参阅文档.withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
这里的答案更详细:https: //stackoverflow.com/a/34077809
回答by Nilay Bhardwaj
you can use.
您可以使用。
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()