scala Spark 中的 map 与 mapValues

Question

提问by jtitusj

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map()and .mapValues()and what are cases where I clearly have to use one instead of the other?

我目前正在学习 Spark 并开发自定义机器学习算法。我的问题是之间的区别是什么.map()和.mapValues()有什么情况我清楚必须使用一个，而不是其他？

Answer 1

回答by Tzach Zohar

mapValuesis only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValuesoperates on the valueonly (the second part of the tuple), while mapoperates on the entire record(tuple of key and value).

mapValues仅适用于 PairRDD，意思是形式为的 RDD RDD[(A, B)]。在这种情况下，只mapValues对值（元组的第二部分）进行map操作，而对整个记录（键和值的元组）进行操作。

In other words, given f: B => Cand rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

换句话说，给定f: B => C和rdd: RDD[(A, B)]，这两个是相同的（几乎 - 请参阅底部的评论）：

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

后者更短更清晰，因此当您只想转换值并保持键原样时，建议使用mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValuesbecause it would only pass the values to your function.

另一方面，如果您也想转换键（例如，您想应用f: (A, B) => C），您根本无法使用，mapValues因为它只会将值传递给您的函数。

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using mapwould "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

最后一个区别涉及分区：如果您对 RDD 应用了任何自定义分区（例如 using partitionBy）， usingmap将“忘记”该分区器（结果将恢复为默认分区），因为键可能已更改；mapValues但是，保留在 RDD 上设置的任何分区器。

Answer 2

回答by Ram Ghadiyaram

maptakes a function that transforms each element of a collection:

map采用一个函数来转换集合的每个元素：

 map(f: T => U)
RDD[T] => RDD[U]

When Tis a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)Where RDD[ (K, V) ] => RDD[ (K, W) ]

WhenT是一个元组，我们可能只想对值进行操作，而不是键 mapValues 采用一个函数将输入中的值映射到输出中的值：mapValues(f: V => W)其中RDD[ (K, V) ] => RDD[ (K, W) ]

Tip: use `mapValues`when you can avoid reshufflewhen data is partitioned by key

提示：当数据按key分区时`mapValues`可以避免reshuffle时使用

Answer 3

回答by vaquar khan

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

当我们将 map() 与 Pair RDD 一起使用时，我们可以访问 Key 和 value。有几次我们只对访问值（而不是键）感兴趣。在这种情况下，我们可以使用 mapValues() 而不是 map()。

Example of mapValues

mapValues 的例子

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));

//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

reduced.collect

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>
                           val temp = x._2
                           val total = temp._1
                           val count = temp._2
                           (x._1, total / count)
                           }

average.collect()

res1: Array[(String, Int)] = Array((english,65), (maths,55))

Answer 4

回答by Kumar Nishikant

val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))

(1)

map():-

val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect

Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))

(2)

mapvalues():-

val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect

Result-

结果-

Array((india,(250,1)), (england,(260,1)), (england,(180,1)))

scala Spark 中的 map 与 mapValues

提问by jtitusj

回答by Tzach Zohar

回答by Ram Ghadiyaram

Tip: use `mapValues`when you can avoid reshufflewhen data is partitioned by key

提示：当数据按key分区时`mapValues`可以避免reshuffle时使用

回答by vaquar khan

回答by Kumar Nishikant

相关推荐

最近更新

标签

scala Spark 中的 map 与 mapValues

提问by jtitusj

回答by Tzach Zohar

回答by Ram Ghadiyaram

Tip: use mapValueswhen you can avoid reshufflewhen data is partitioned by key

提示：当数据按key分区时mapValues可以避免reshuffle时使用

回答by vaquar khan

回答by Kumar Nishikant

相关推荐

如何在 Pyspark 中使用 Scala 类

scala 错误 SparkContext：初始化 SparkContext 时出错

线程“main”中的 Apache Spark 异常 java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

从 Scala 脚本中退出 Spark-shell

相关推荐

最近更新

标签

Tip: use `mapValues`when you can avoid reshufflewhen data is partitioned by key

提示：当数据按key分区时`mapValues`可以避免reshuffle时使用