scala 在 Spark Dataframe 中,如何在两个数据框中获取重复记录和不同记录?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40026335/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:44:09  来源:igfitidea点击:

In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

scalaapache-spark

提问by Shekhar

I am working on a problem in which I am loading data from a hive table into spark dataframe and now I want all the unique accts in 1 dataframe and all duplicates in another. for example if I have acct id 1,1,2,3,4. I want to get 2,3,4 in one dataframe and 1,1 in another. How can I do this?

我正在解决一个问题,我将 hive 表中的数据加载到 spark 数据帧中,现在我想要 1 个数据帧中的所有唯一 accts 和另一个数据帧中的所有重复项。例如,如果我的帐户 ID 为 1、1、2、3、4。我想在一个数据帧中获得 2,3,4,在另一个数据帧中获得 1,1。我怎样才能做到这一点?

采纳答案by KiranM

val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
scala> acctDF.show()
+------+-------+
|AcctId|Details|
+------+-------+
|     1|   Acc1|
|     1|   Acc1|
|     1|   Acc1|
|     2|   Acc2|
|     2|   Acc2|
|     3|   Acc3|
+------+-------+

val countsDF = acctDF.map(rec => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")

val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))

scala> accJoinedDF.show()
+------+-------+---------+   
|AcctId|Details|AcctCount|
+------+-------+---------+
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     2|   Acc2|        2|
|     2|   Acc2|        2|
|     3|   Acc3|        1|
+------+-------+---------+


val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
scala> distAcctDF.show()
+------+-------+---------+   
|AcctId|Details|AcctCount|
+------+-------+---------+
|     3|   Acc3|        1|
+------+-------+---------+

val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
scala> duplAcctDF.show()
+------+-------+---------+                 
|AcctId|Details|AcctCount|
+------+-------+---------+
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     2|   Acc2|        2|
|     2|   Acc2|        2|
+------+-------+---------+

(OR scala> duplAcctDF.distinct.show() )

回答by valearner

Depending on the version of spark you have, you could use window functions in datasets/sql like below:

根据您拥有的 spark 版本,您可以在 datasets/sql 中使用窗口函数,如下所示:

Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );

Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));

Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));

the above is written in java. should be similar in scala and read this on how to do in python. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

上面是用java写的。在 scala 中应该是类似的,并阅读有关如何在 python 中执行的内容。 https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

回答by Akshat Chaturvedi

df.groupBy($"field1",$"field2"...).count.filter($"count">1).show()

df.groupBy($"field1",$"field2"...).count.filter($"count">1).show()