scala Spark 从 DataFrame 中删除重复的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35498162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:01:49  来源:igfitidea点击:

Spark remove duplicate rows from DataFrame

scalaapache-sparkdataframeapache-spark-sql

提问by void

Assume that I am having a DataFrame like :

假设我有一个 DataFrame 像:

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

我想根据列“b”的值删除列“a”的重复行。即,如果列“a”有重复的行,我想保留“b”值较大的行。对于上面的例子,经过处理,我只需要

{"a":3, "b":9, "c":22, "d":12}

{“a”:3,“b”:9,“c”:22,“d”:12}

and

{"a":1, "b":4, "c":23, "d":12}

{“a”:1,“b”:4,“c”:23,“d”:12}

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

Spark DataFrame dropDuplicates API 似乎不支持这一点。使用 RDD 方法,我可以做一个map().reduceByKey(),但是有什么 DataFrame 特定的操作可以做到这一点?

Appreciate some help, thanks.

感谢一些帮助,谢谢。

回答by Pankaj Arora

You can use window function in sparksql to achieve this.

您可以在 sparksql 中使用窗口函数来实现这一点。

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

这将实现您所需要的。阅读更多关于窗口函数支持https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html