scala 在 spark 中加入两个 RDD

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33321704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:45:00  来源:igfitidea点击:

Join two RDD in spark

scalaapache-spark

提问by sri hari kali charan Tummala

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?

我有两个 rdd 一个 rdd 只有一列,其他有两列连接键上的两个 RDD 我添加了 0 的虚拟值,还有其他有效的方法可以使用 join 来做到这一点吗?

val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")

val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()

Edit:

编辑

Let me convert this question in SQL. Say for example I have table1 (moveid)and table2 (movieid,moviename). In SQL we write something like:

让我在 SQL 中转换这个问题。比如说我有table1 (moveid)table2 (movieid,moviename)。在 SQL 中,我们这样写:

select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid 
group by ....

here in SQL table1has only one column where as table2has two columns still the joinworks, same way in Spark can join on keys from both the RDD's.

这里在 SQLtable1中只有一列,其中table2两列仍然join有效,同样的方式在 Spark 中可以加入来自两个 RDD 的键。

回答by zero323

Join operation is defined only on PairwiseRDDswhich are quite different from a relation / table in SQL. Each element of PairwiseRDDis a Tuple2where the first element is the keyand the second is value. Both can contain complex objects as long as keyprovides a meaningful hashCode

联接操作仅在PairwiseRDDs与 SQL 中的关系/表完全不同的情况下定义。的每个元素PairwiseRDD都是 a Tuple2,其中第一个元素是key,第二个元素是value。两者都可以包含复杂的对象,只要key提供一个有意义的hashCode

If you want to think about this in a SQL-ish you can consider key as everything that goes to ONclause and valuecontains selected columns.

如果您想在 SQL-ish 中考虑这一点,您可以将键视为进入ON子句并value包含选定列的所有内容。

SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key

While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while keyand valuein the PairwiseRDDhave a clear meaning.

虽然这些方法乍一看很相似,并且您可以使用另一种方法表达一种方法,但存在一个根本区别。当你在SQL表,你忽略了约束所有列属于同一类的对象,而keyvaluePairwiseRDD有一个明确的含义。

Going back to your problem to use joinyou need both keyand value. Arguably much cleaner than using 0as a placeholder would be to use nullsingleton but there is really no way around it.

回到你的问题来使用join你需要keyvalue。可以说比0用作占位符更干净的是使用null单例,但实际上没有办法解决它。

For small data you can use filter in a similar way to broadcast join:

对于小数据,您可以以类似的方式使用过滤器来广播连接:

val moviesidBD = sc.broadcast(
  lines.map(x => x.split("\t")).map(_.head).collect.toSet)

movienames.filter{case (id, _) => moviesidBD.value contains id}

but if you really want SQL-ish joins then you should simply use SparkSQL.

但是如果你真的想要 SQL-ish 连接,那么你应该简单地使用 SparkSQL。

val movieIdsDf = lines
   .map(x => x.split("\t"))
   .map(a => Tuple1(a.head))
   .toDF("id")

val movienamesDf = movienames.toDF("id", "name")

// Add optional join type qualifier 
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))