scala Spark 从 DataFrame 中删除重复的行

Question

提问by void

Assume that I am having a DataFrame like :

假设我有一个 DataFrame 像：

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

我想根据列“b”的值删除列“a”的重复行。即，如果列“a”有重复的行，我想保留“b”值较大的行。对于上面的例子，经过处理，我只需要

{"a":3, "b":9, "c":22, "d":12}

{“a”：3，“b”：9，“c”：22，“d”：12}

and

和

{"a":1, "b":4, "c":23, "d":12}

{“a”：1，“b”：4，“c”：23，“d”：12}

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

Spark DataFrame dropDuplicates API 似乎不支持这一点。使用 RDD 方法，我可以做一个map().reduceByKey()，但是有什么 DataFrame 特定的操作可以做到这一点？

Appreciate some help, thanks.

感谢一些帮助，谢谢。

Answer 1

回答by Pankaj Arora

You can use window function in sparksql to achieve this.

您可以在 sparksql 中使用窗口函数来实现这一点。

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

这将实现您所需要的。阅读更多关于窗口函数支持https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

scala Spark 从 DataFrame 中删除重复的行

提问by void

回答by Pankaj Arora

相关推荐

最近更新

标签

scala Spark 从 DataFrame 中删除重复的行

提问by void

回答by Pankaj Arora

相关推荐

在 Scala 中将选项转换为任一

scala 如何测试客户端 Akka HTTP

scala 使用案例类对 JSON 进行编码时，为什么会出现错误“无法找到存储在数据集中的类型的编码器”？

Spark 的 Scala 与 Java？

相关推荐

最近更新

标签