如何使用 Spark/Scala 展平集合?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23138352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:12:07  来源:igfitidea点击:

How to flatten a collection with Spark/Scala?

scalaapache-spark

提问by blue-sky

In Scala I can flatten a collection using :

在 Scala 中,我可以使用以下方法展平集合:

val array = Array(List("1,2,3").iterator,List("1,4,5").iterator)
                                                  //> array  : Array[Iterator[String]] = Array(non-empty iterator, non-empty itera
                                                  //| tor)


    array.toList.flatten                      //> res0: List[String] = List(1,2,3, 1,4,5)

But how can I perform similar in Spark ?

但是我如何在 Spark 中执行类似的操作?

Reading the API doc http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDDthere does not seem to be a method which provides this functionality ?

阅读 API 文档http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDD似乎没有提供此功能的方法?

回答by samthebest

Use flatMapand the identityPredef, this is more readable than using x => x, e.g.

使用flatMapidentityPredef,这比使用更具可读性x => x,例如

myRdd.flatMap(identity)

回答by Josh Rosen

Try flatMap with an identity map function (y => y):

尝试带有身份映射函数 ( y => y) 的flatMap :

scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12

scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))

scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15

scala> x.flatMap(y => y).collect()
res4: Array[String] = Array(a, b, c, d)