scala Spark 中的 map 与 mapValues

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36696326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:11:07  来源:igfitidea点击:

map vs mapValues in Spark

scalaapache-spark

提问by jtitusj

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map()and .mapValues()and what are cases where I clearly have to use one instead of the other?

我目前正在学习 Spark 并开发自定义机器学习算法。我的问题是之间的区别是什么.map().mapValues()有什么情况我清楚必须使用一个,而不是其他?

回答by Tzach Zohar

mapValuesis only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValuesoperates on the valueonly (the second part of the tuple), while mapoperates on the entire record(tuple of key and value).

mapValues仅适用于 PairRDD,意思是形式为 的 RDD RDD[(A, B)]。在这种情况下,只mapValues(元组的第二部分)进行map操作,而对整个记录(键和值的元组)进行操作。

In other words, given f: B => Cand rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

换句话说,给定f: B => Crdd: RDD[(A, B)],这两个是相同的(几乎 - 请参阅底部的评论):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

后者更短更清晰,因此当您只想转换值并保持键原样时,建议使用mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValuesbecause it would only pass the values to your function.

另一方面,如果您也想转换键(例如,您想应用f: (A, B) => C),您根本无法使用,mapValues因为它只会将值传递给您的函数。

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using mapwould "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

最后一个区别涉及分区:如果您对 RDD 应用了任何自定义分区(例如 using partitionBy), usingmap将“忘记”该分区器(结果将恢复为默认分区),因为键可能已更改;mapValues但是,保留在 RDD 上设置的任何分区器。

回答by Ram Ghadiyaram

maptakes a function that transforms each element of a collection:

map采用一个函数来转换集合的每个元素:

 map(f: T => U)
RDD[T] => RDD[U]

When Tis a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)Where RDD[ (K, V) ] => RDD[ (K, W) ]

WhenT是一个元组,我们可能只想对值进行操作,而不是键 mapValues 采用一个函数将输入中的值映射到输出中的值:mapValues(f: V => W)其中RDD[ (K, V) ] => RDD[ (K, W) ]

Tip: use mapValueswhen you can avoid reshufflewhen data is partitioned by key

提示:当数据按key分区时mapValues可以避免reshuffle时使用

回答by vaquar khan

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

当我们将 map() 与 Pair RDD 一起使用时,我们可以访问 Key 和 value。有几次我们只对访问值(而不是键)感兴趣。在这种情况下,我们可以使用 mapValues() 而不是 map()。

Example of mapValues

mapValues 的例子

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));

//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

reduced.collect

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>
                           val temp = x._2
                           val total = temp._1
                           val count = temp._2
                           (x._1, total / count)
                           }

average.collect()

res1: Array[(String, Int)] = Array((english,65), (maths,55))

res1: Array[(String, Int)] = Array((english,65), (maths,55))

回答by Kumar Nishikant

val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))

(1)

(1)

map():-

val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect

Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))

(2)

(2)

mapvalues():-

val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect

Result-

结果-

Array((india,(250,1)), (england,(260,1)), (england,(180,1)))