scala Spark 中的 map 与 mapValues
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36696326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
map vs mapValues in Spark
提问by jtitusj
I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map()and .mapValues()and what are cases where I clearly have to use one instead of the other?
我目前正在学习 Spark 并开发自定义机器学习算法。我的问题是之间的区别是什么.map()和.mapValues()有什么情况我清楚必须使用一个,而不是其他?
回答by Tzach Zohar
mapValuesis only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValuesoperates on the valueonly (the second part of the tuple), while mapoperates on the entire record(tuple of key and value).
mapValues仅适用于 PairRDD,意思是形式为 的 RDD RDD[(A, B)]。在这种情况下,只mapValues对值(元组的第二部分)进行map操作,而对整个记录(键和值的元组)进行操作。
In other words, given f: B => Cand rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):
换句话说,给定f: B => C和rdd: RDD[(A, B)],这两个是相同的(几乎 - 请参阅底部的评论):
val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }
val result: RDD[(A, C)] = rdd.mapValues(f)
The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.
后者更短更清晰,因此当您只想转换值并保持键原样时,建议使用mapValues.
On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValuesbecause it would only pass the values to your function.
另一方面,如果您也想转换键(例如,您想应用f: (A, B) => C),您根本无法使用,mapValues因为它只会将值传递给您的函数。
The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using mapwould "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.
最后一个区别涉及分区:如果您对 RDD 应用了任何自定义分区(例如 using partitionBy), usingmap将“忘记”该分区器(结果将恢复为默认分区),因为键可能已更改;mapValues但是,保留在 RDD 上设置的任何分区器。
回答by Ram Ghadiyaram
maptakes a function that transforms each element of a collection:
map采用一个函数来转换集合的每个元素:
map(f: T => U)
RDD[T] => RDD[U]
When Tis a tuple we may want to only act on the values – not the keys
mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)Where RDD[ (K, V) ] => RDD[ (K, W) ]
WhenT是一个元组,我们可能只想对值进行操作,而不是键 mapValues 采用一个函数将输入中的值映射到输出中的值:mapValues(f: V => W)其中RDD[ (K, V) ] => RDD[ (K, W) ]
Tip: use mapValueswhen you can avoid reshufflewhen data is partitioned by key
提示:当数据按key分区时mapValues可以避免reshuffle时使用
回答by vaquar khan
When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().
当我们将 map() 与 Pair RDD 一起使用时,我们可以访问 Key 和 value。有几次我们只对访问值(而不是键)感兴趣。在这种情况下,我们可以使用 mapValues() 而不是 map()。
Example of mapValues
mapValues 的例子
val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));
//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
reduced.collect
Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))
Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))
val average = reduced.map { x =>
val temp = x._2
val total = temp._1
val count = temp._2
(x._1, total / count)
}
average.collect()
res1: Array[(String, Int)] = Array((english,65), (maths,55))
res1: Array[(String, Int)] = Array((english,65), (maths,55))
回答by Kumar Nishikant
val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))
(1)
(1)
map():-
val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect
Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))
(2)
(2)
mapvalues():-
val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect
Result-
结果-
Array((india,(250,1)), (england,(260,1)), (england,(180,1)))

