Python pyspark中groupBy之后的列别名

Question

提问by mhn

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.

我需要下一行中的结果数据框，以便在 groupBy 之后为 max('diff') 列使用别名“maxDiff”。但是，下面的行不会进行任何更改，也不会引发错误。

 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")

Answer 1

采纳答案by Nhor

This is because you are aliasing the whole DataFrameobject, not Column. Here's an example how to alias the Columnonly:

这是因为您正在为整个DataFrame对象设置别名，而不是Column. 以下是如何为Columnonly设置别名的示例：

import pyspark.sql.functions as func

grpdf = joined_df \
    .groupBy(temp1.datestamp) \
    .max('diff') \
    .select(func.col("max(diff)").alias("maxDiff"))

Answer 2

回答by zero323

You can use agginstead of calling maxmethod:

您可以使用agg代替调用max方法：

from pyspark.sql.functions import max

joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))

Similarly in Scala

同样在 Scala 中

import org.apache.spark.sql.functions.max

joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))

or

或者

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))

Answer 3

回答by vk1011

In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions:

除了此处已有的答案之外，如果您知道聚合列的名称，则以下也是方便的方法，您不必从中导入pyspark.sql.functions：

1

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .selectExpr('max(diff) AS maxDiff')

See docsfor info on .selectExpr()

有关信息，请参阅文档.selectExpr()

2

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .withColumnRenamed('max(diff)', 'maxDiff')

See docsfor info on .withColumnRenamed()

有关信息，请参阅文档.withColumnRenamed()

This answer here goes into more detail: https://stackoverflow.com/a/34077809

这里的答案更详细：https: //stackoverflow.com/a/34077809

Answer 4

回答by Nilay Bhardwaj

you can use.

您可以使用。

grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()

Python pyspark中groupBy之后的列别名

提问by mhn

采纳答案by Nhor

回答by zero323

回答by vk1011

回答by Nilay Bhardwaj

相关推荐

最近更新

标签

Python pyspark中groupBy之后的列别名

提问by mhn

采纳答案by Nhor

回答by zero323

回答by vk1011

回答by Nilay Bhardwaj

相关推荐

Python 使用 pygame 旋转图像

Python 导入错误：没有名为 Pandas 的模块

如何将 python time.struct_time 对象转换为 ISO 字符串？

Python 如何按熊猫中的值对系列进行分组？

相关推荐

最近更新

标签