SQL 如何使用pyspark从按某些字段分组的给定数据集中获取最大值(日期)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38377894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:43:23  来源:igfitidea点击:

how to get max(date) from given set of data grouped by some fields using pyspark?

sqlapache-sparkpysparkapache-spark-sqlpyspark-sql

提问by cool716

I have the data in the dataframe as below:

我在数据框中的数据如下:

  datetime             | userId | memberId | value |    
2016-04-06 16:36:...   | 1234   | 111      | 1
2016-04-06 17:35:...   | 1234   | 222      | 5
2016-04-06 17:50:...   | 1234   | 111      | 8
2016-04-06 18:36:...   | 1234   | 222      | 9
2016-04-05 16:36:...   | 4567   | 111      | 1
2016-04-06 17:35:...   | 4567   | 222      | 5
2016-04-06 18:50:...   | 4567   | 111      | 8
2016-04-06 19:36:...   | 4567   | 222      | 9

I need to find the max(datetime) groupby userid,memberid. When I tried as below:

我需要通过 userid,memberid 找到 max(datetime) group。当我尝试如下:

df2 = df.groupBy('userId','memberId').max('datetime')

I'm getting error as:

我收到错误如下:

org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;

The output I desired is as follows:

我想要的输出如下:

userId | memberId | datetime
1234   |  111     | 2016-04-06 17:50:...
1234   |  222     | 2016-04-06 18:36:...
4567   |  111     | 2016-04-06 18:50:...
4567   |  222     | 2016-04-06 19:36:...

Can someone please help me how I get the max date among the given data using PySpark dataframes?

有人可以帮助我如何使用 PySpark 数据帧获取给定数据中的最大日期吗?

回答by zero323

For non-numeric but Orderabletypes you can use aggwith maxdirectly:

对于非数字,但Orderable类型,您可以使用aggmax直接:

from pyspark.sql.functions import col, max as max_

df = sc.parallelize([
    ("2016-04-06 16:36", 1234, 111, 1),
    ("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])

(df.withColumn("datetime", col("datetime").cast("timestamp"))
    .groupBy("userId", "memberId")
    .agg(max_("datetime")))

## +------+--------+--------------------+
## |userId|memberId|       max(datetime)|
## +------+--------+--------------------+
## |  1234|     111|2016-04-06 17:35:...|
## +------+--------+--------------------+