scala 我们如何对数据框进行排名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42966590/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:09:19  来源:igfitidea点击:

How do we rank dataframe?

scalaapache-sparkapache-spark-sql

提问by user3293666

I have sample dataframe as below :

我有如下示例数据框:

i/p

输入/输出

accountNumber   assetValue  
A100            1000         
A100            500          
B100            600          
B100            200          

o/p

o/p

AccountNumber   assetValue  Rank
A100            1000         1
A100            500          2
B100            600          1
B100            200          2

Now my question is how do we add this rank column on dataframe which is sorted by account number. I am not expecting huge volume of rows so open to idea if I need to do it outside of dataframe.

现在我的问题是我们如何在按帐号排序的数据框中添加这个排名列。如果我需要在数据框之外进行操作,我不希望有大量的行如此开放。

I am using Spark version 1.5 and SQLContext hence cannot use Windows function

我使用的是 Spark 1.5 版和 SQLContext 因此无法使用 Windows 功能

回答by Psidom

You can use row_numberfunction and Windowexpression with which you can specify the partitionand ordercolumns:

您可以使用可以指定和列的row_number函数和Window表达式:partitionorder

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number

val df = Seq(("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)).toDF("accountNumber", "assetValue")

df.withColumn("rank", row_number().over(Window.partitionBy($"accountNumber").orderBy($"assetValue".desc))).show

+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+

回答by Nayan Sharma

Raw SQL:

原始 SQL:

val df = sc.parallelize(Seq(
  ("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)
)).toDF("accountNumber", "assetValue")

df.registerTempTable("df")
sqlContext.sql("SELECT accountNumber,assetValue, RANK() OVER (partition by accountNumber ORDER BY assetValue desc) AS rank FROM df").show


+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+