在 Scala 中将 RDD 映射到 PairRDD

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30655914/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:13:13  来源:igfitidea点击:

map RDD to PairRDD in Scala

javascalaapache-sparkrdd

提问by Edamame

I am trying to map RDD to pairRDD in scala, so I could use reduceByKey later. Here is what I did:

我正在尝试将 RDD 映射到 Scala 中的 pairRDD,因此我可以稍后使用 reduceByKey。这是我所做的:

userRecords is of org.apache.spark.rdd.RDD[UserElement]

userRecords 属于 org.apache.spark.rdd.RDD[UserElement]

I try to create a pairRDD from userRecords like below:

我尝试从 userRecords 创建一个pairRDD,如下所示:

val userPairs: PairRDDFunctions[String, UserElement] = userRecords.map { t =>
  val nameKey: String = t.getName()
  (nameKey, t)
}

However, I got the error:

但是,我得到了错误:

type mismatch; found : org.apache.spark.rdd.RDD[(String, com.mypackage.UserElement)] required: org.apache.spark.rdd.PairRDDFunctions[String,com.mypackage.UserElement]

类型不匹配; 发现:org.apache.spark.rdd.RDD[(String, com.mypackage.UserElement)] 需要:org.apache.spark.rdd.PairRDDFunctions[String,com.mypackage.UserElement]

What am I missing here? Thanks a lot!

我在这里错过了什么?非常感谢!

采纳答案by marios

I think you are just missing the import to org.apache.spark.SparkContext._. This brings all the right implicit conversions in scope to create the PairRDD.

我认为您只是缺少导入到org.apache.spark.SparkContext._. 这将所有正确的隐式转换带入创建 PairRDD 的范围内。

The example below should work (assuming you have initialized a SparkContext under sc):

下面的例子应该可以工作(假设你已经在 sc 下初始化了一个 SparkContext):

import org.apache.spark.SparkContext._

val f = sc.parallelize(Array(1,2,3,4,5))
val g: PairRDDFunctions[String, Int] = f.map( x => (x.toString, x))

回答by Justin Pihony

You don't need to do that as it is done via implicits(explicitly rddToPairRDDFunctions). Any RDD that is of type Tuple2[K,V]can automatically be used as a PairRDDFunctions. If you REALLY want to, you can explicitly do what the implicitdoes and wrap the RDD in a PairRDDFunction:

您不需要这样做,因为它是通过隐式(显式rddToPairRDDFunctions)完成的。任何类型的 RDD 都Tuple2[K,V]可以自动用作PairRDDFunctions. 如果您真的想要,您可以明确地执行该implicit操作并将 RDD 包装在 a 中PairRDDFunction

val pair = new PairRDDFunctions(rdd)

回答by Srini

You can also use keyBy method, you need to provide the key in the function,

也可以使用keyBy方法,需要在函数中提供key,

in your example, you can simply give userRecords.keyBy(t => t.getName())

在你的例子中,你可以简单地给 userRecords.keyBy(t => t.getName())