使用 scala 在 Apache spark 中连接不同 RDD 的数据集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27395420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:46:45  来源:igfitidea点击:

Concatenating datasets of different RDDs in Apache spark using scala

scalaapache-sparkapache-spark-sqldistributed-computingrdd

提问by Atom

Is there a way to concatenate datasets of two different RDDs in spark?

有没有办法RDD在 spark 中连接两个不同s 的数据集?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

要求是 - 我使用 scala 创建了两个具有相同列名的中间 RDD,需要组合这两个 RDD 的这些结果并缓存结果以访问 UI。我如何在这里合并数据集?

RDDs are of type spark.sql.SchemaRDD

RDD 是类型 spark.sql.SchemaRDD

回答by maasg

I think you are looking for RDD.union

我想你正在寻找 RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

示例(在 Spark-shell 上)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

回答by Josep Curto Díaz

I had the same problem. To combine by row instead of column use unionAll:

我有同样的问题。要按行组合而不是按列组合,请使用 unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

我在阅读数据框的方法摘要后找到了它。更多信息请访问:https: //spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html