如何在 Spark Scala 中使用 mapPartitions？

Question

提问by Spar

I have DocsRDD : RDD[String, String]

我有 DocsRDD : RDD[String, String]

val DocsRDD = sc.wholeTextFiles("myDirectory/*" , 2)

DocsRDD:

文档RDD：

Doc1.txt , bla bla bla .....\n bla bla bla \n bla ... bla
Doc2.txt , bla bla bla .....bla \n bla bla \n bla ... bla
Doc3.txt , bla bla bla .....\n bla bla bla \n bla ... bla
Doc4.txt , bla bla \n  .....\n bla bla bla bla \n ... bla

Is there an efficient, elegant way to extract n-grams from these with mapPartitions? So far i have tried everything, i have read everything i could find at least 5 times over and over about mapPartitions but i still cannot understand how to use it! It seems waaay too difficult to manipulate. In short i want :

是否有一种高效、优雅的方法可以使用 mapPartitions 从这些中提取 n-gram？到目前为止，我已经尝试了所有方法，我已经一遍又一遍地阅读了我能找到的关于 mapPartitions 的所有内容，但我仍然无法理解如何使用它！似乎太难操纵了。总之我想要：

val NGramsRDD = DocsRDD.map(x => (x._1 , x._2.sliding(n) ) )

but efficiently with mapPartitions. My basic misunderstanding of mapPartitions is :

但使用 mapPartitions 很有效。我对 mapPartitions 的基本误解是：

OneDocRDD : RDD[String]

OneDocRDD：RDD[字符串]

 val OneDocRDD = sc.textFile("myDoc1.txt" , 2)
                   .mapPartitions(s1 : Iterator[String] => s2 : Iterator[String])

I Cannot understand this! From when s1 was Iterator[String]? s1 is String after sc.textfile.

我无法理解这！从什么时候 s1 是 Iterator[String]？s1 是 sc.textfile 之后的字符串。

Alright my second question is : Will mapPartitions improve my overcome against map in this situation?

好吧，我的第二个问题是：在这种情况下，mapPartitions 会改善我对地图的克服吗？

Last but not Least important: can f() be :

最后但并非最不重要的： f() 可以是：

     f(Iterator[String]) : Iterator[Something else?]

Answer 1

回答by Pascal Soucy

I'm not sure that .mapPartitions will help (at least, not given the example), but using .mapPartitions would look like:

我不确定 .mapPartitions 会有所帮助（至少，没有给出示例），但使用 .mapPartitions 看起来像：

val OneDocRDD = sc.textFile("myDoc1.txt", 2)
  .mapPartitions(iter => {
    // here you can initialize objects that you would need 
    // that you want to create once by worker and not for each x in the map. 
    iter.map(x => (x._1 , x._2.sliding(n)))
  })

Normally you want to use .mapPartitions to create/initialize an object you don't want (example: too big) or can't serialize to the worker nodes. Without .mapPartitions you would need to create them in the .map, but that would not be efficient since the object would be created for each x.

通常您想使用 .mapPartitions 来创建/初始化您不想要的对象（例如：太大）或无法序列化到工作节点。如果没有 .mapPartitions，您将需要在 .map 中创建它们，但这效率不高，因为将为每个 x 创建对象。

如何在 Spark Scala 中使用 mapPartitions？

提问by Spar

回答by Pascal Soucy

相关推荐

最近更新

标签

如何在 Spark Scala 中使用 mapPartitions？

提问by Spar

回答by Pascal Soucy

相关推荐

Spark 和 Scala 中数据框的转换模式

scala 如何在 spark-shell 中运行外部 jar 函数

scala 什么时候在 Spark 中使用 Kryo 序列化？

scala (Spark) 对象 {name} 不是包 org.apache.spark.ml 的成员

相关推荐

最近更新

标签