scala value join 不是 org.apache.spark.rdd.RDD 的成员

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29265616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:00:00  来源:igfitidea点击:

value join is not a member of org.apache.spark.rdd.RDD

scalaapache-spark

提问by sds

I get this error:

我收到此错误:

value join is not a member of 
    org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[_0])))
        forSome { type _0 <: (String, Double) }]

The only suggestion I found is import org.apache.spark.SparkContext._I am already doing that.

我发现的唯一建议是import org.apache.spark.SparkContext._我已经在这样做了。

What am I doing wrong?

我究竟做错了什么?

EDIT: changing the code to eliminate forSome(i.e., when the object has type org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[(String, Double)])))) solved the problem. Is this a bug in Spark?

编辑:更改代码以消除forSome(即,当对象具有类型时org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[(String, Double)]))))解决了问题。这是 Spark 中的错误吗?

回答by Daniel Darabos

joinis a member of org.apache.spark.rdd.PairRDDFunctions. So why does the implicit class not trigger?

join是 的成员org.apache.spark.rdd.PairRDDFunctions。那么为什么隐式类不触发呢?

scala> val s = Seq[(Long, (Int, (Long, String, Array[_0]))) forSome { type _0 <: (String, Double) }]()
scala> val r = sc.parallelize(s)
scala> r.join(r) // Gives your error message.
scala> val p = new org.apache.spark.rdd.PairRDDFunctions(r)
<console>:25: error: no type parameters for constructor PairRDDFunctions: (self: org.apache.spark.rdd.RDD[(K, V)])(implicit kt: scala.reflect.ClassTag[K], implicit vt: scala.reflect.ClassTag[V], implicit ord: Ordering[K])org.apache.spark.rdd.PairRDDFunctions[K,V] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[_0]))) forSome { type _0 <: (String, Double) }])
 --- because ---
argument expression's type is not compatible with formal parameter type;
 found   : org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[_0]))) forSome { type _0 <: (String, Double) }]
 required: org.apache.spark.rdd.RDD[(?K, ?V)]
Note: (Long, (Int, (Long, String, Array[_0]))) forSome { type _0 <: (String, Double) } >: (?K, ?V), but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
       val p = new org.apache.spark.rdd.PairRDDFunctions(r)
               ^
<console>:25: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(Long, (Int, (Long, String, Array[_0]))) forSome { type _0 <: (String, Double) }]
 required: org.apache.spark.rdd.RDD[(K, V)]
       val p = new org.apache.spark.rdd.PairRDDFunctions(r)

I'm sure that error message is clear to everyone else, but just for my own slow self let's try to make sense of it. PairRDDFunctionshas two type parameters, Kand V. Your forSomeis for the whole pair, so it cannot be split into separate Kand Vtypes. There are no Kand Vthat RDD[(K, V)]would equal your RDD type.

我相信其他人都清楚错误消息,但为了我自己的缓慢,让我们尝试理解它。PairRDDFunctions有两个类型参数,KV. 你forSome是整对的,所以它不能分成单独的KV类型。没有KVRDD[(K, V)]将等于您的 RDD 类型。

However, you could have the forSomeonly apply to the key, instead of the whole pair. Join works now, because this type can be separated into Kand V.

但是,您可以将forSome唯一应用于密钥,而不是整个对。Join 现在起作用了,因为这种类型可以分为KV

scala> val s2 = Seq[(Long, (Int, (Long, String, Array[_0])) forSome { type _0 <: (String, Double) })]()
scala> val r2 = sc.parallelize(2s)
scala> r2.join(r2)
res0: org.apache.spark.rdd.RDD[(Long, ((Int, (Long, String, Array[_0])) forSome { type _0 <: (String, Double) }, (Int, (Long, String, Array[_0])) forSome { type _0 <: (String, Double) }))] = MapPartitionsRDD[5] at join at <console>:26

回答by V Jaiswal

Consider 2 Spark RDDs to be joined together..

考虑将 2 个 Spark RDD 连接在一起..

Say, rdd1.firstis in the form of (Int, Int, Float) = (1,957,299.98)while rdd2.firstis something like (Int, Int) = (25876,1)where the join is supposed to take place on the 1st field from both the RDDs.

比如说, whilerdd1.first的形式类似于应该在两个 RDD 的第一个字段上进行连接的地方。(Int, Int, Float) = (1,957,299.98)rdd2.first(Int, Int) = (25876,1)

scala> rdd1.join(rdd2) --- results in an error :**: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Float)]

scala> rdd1.join(rdd2) --- 导致错误 :**: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Float)]

REASON

原因



Both the RDDs should be in the form of a Key-Value pair.

两个 RDD 都应该采用键值对的形式。

Here, rdd2 -- being in the form of (1,957,299.98) -- does not obey this rule.. While rdd1 -- which is in the form of (25876,1) -- does.

在这里, rdd2——以 (1,957,299.98) 的形式——不遵守这个规则。而 rdd1——以 (25876,1) 的形式——确实如此。

RESOLUTION

解决



Convert the output of the 1st RDD from (1,957,299.98)to a Key-Value pair in the form of (1,(957,299.98))before joining it with rdd2, as shown below:

将第一个RDD的输出从 转换为(1,957,299.98)形式的键值对,(1,(957,299.98))然后再与rdd2连接,如下图:

scala> val rdd1KV = rdd1.map(x=>(x.split(",")(1).toInt,(x.split(",")(2).toInt,x.split(",")(4).toFloat))) -- modified RDD

scala> rdd1KV.join(rdd2) -- join successful :)
res**: (Int, (Int, Float)) = (1,(957,299.98))

By the way, join is the member of org.apache.spark.rdd.PairRDDFunctions. So make sure you import this on your Eclipse or IDE, wherever you want to run your code.

顺便说一下,join 是 org.apache.spark.rdd.PairRDDFunctions 的成员。因此,请确保将它导入到 Eclipse 或 IDE 中,无论您想在何处运行代码。

Article also on my blog:

文章也在我的博客上:

https://tips-to-code.blogspot.com/2018/08/apache-spark-error-resolution-value.html

https://tips-to-code.blogspot.com/2018/08/apache-spark-error-resolution-value.html