Python Spark DataFrame groupBy 并按降序排序 (pyspark)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34514545/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:06:41  来源:igfitidea点击:

Spark DataFrame groupBy and sort in the descending order (pyspark)

pythonapache-sparkdataframepysparkapache-spark-sql

提问by rclakmal

I'm using pyspark(Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.

我正在使用 pyspark(Python 2.7.9/Spark 1.3.1) 并有一个数据框 GroupObject,我需要按降序过滤和排序。试图通过这段代码来实现它。

group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)

But it throws the following error.

但它会引发以下错误。

sort() got an unexpected keyword argument 'ascending'

采纳答案by zero323

In PySpark 1.3 sortmethod doesn't take ascending parameter. You can use descmethod instead:

在 PySpark 1.3sort方法中不采用升序参数。您可以改用desc方法:

from pyspark.sql.functions import col

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(col("count").desc()))

or descfunction:

desc功能:

from pyspark.sql.functions import desc

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(desc("count"))

Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

这两种方法都可以与 Spark >= 1.3(包括 Spark 2.x)一起使用。

回答by Henrique Florêncio

Use orderBy:

使用orderBy

group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

回答by Narendra Maru

you can use groupBy and orderBy as follows also

您也可以使用 groupBy 和 orderBy 如下

dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

回答by gdoron is supporting Monica

By far the most convenient way is using this:

到目前为止,最方便的方法是使用这个:

df.orderBy(df.column_name.desc())

Doesn't require special imports.

不需要特殊的进口。

回答by Prabhath Kota

In pyspark 2.4.4

在 pyspark 2.4.4 中

1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

2) from pyspark.sql.functions import desc
   group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))

No need to import in 1) and 1) is short & easy to read,
So I prefer 1) over 2)

无需导入 1) 和 1) 简短易读,
所以我更喜欢 1) 而不是 2)