如何将 Scala RDD 转换为 Map

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26351382/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:37:26  来源:igfitidea点击:

How to convert Scala RDD to Map

scalaapache-spark

提问by Soumitra

I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18]and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId' but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]'but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

我有一个 RDD(字符串数组)org.apache.spark.rdd.RDD[String] = MappedRDD[18]并将其转换为具有唯一 ID 的地图。我做了 ' val vertexMAp = vertices.zipWithUniqueId' 但这给了我另一个类型的 RDD'org.apache.spark.rdd.RDD[(String, Long)]'但我想要一个 ' Map[String, Long]' 。我怎样才能转换我的 ' org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

Thanks

谢谢

回答by maasg

There's a built-in collectAsMapfunction in PairRDDFunctionsthat would deliver you a map of the pair values in the RDD.

有一个内置collectAsMap函数PairRDDFunctions可以为您提供 RDD 中成对值的映射。

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.

重要的是要记住,RDD 是一种分布式数据结构。您可以将其可视化为分布在集群中的数据的“片段”。当你collect,你强迫所有这些碎片去驱动程序并且能够做到这一点时,它们需要适合驱动程序的内存。

From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.

从评论来看,在您的情况下,您需要处理大型数据集。用它制作 Map 是行不通的,因为它不适合驱动程序的内存;如果您尝试会导致 OOM 异常。

You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookupon a PairRDD instead, like this:

您可能需要将数据集保留为 RDD。如果您要创建 Map 以查找元素,则可以lookup在 PairRDD 上使用,如下所示:

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")

回答by Eugene Zhulenev

Collect to "local" machine and then convert Array[(String, Long)] to Map

收集到“本地”机器,然后将 Array[(String, Long)] 转换为 Map

val rdd: RDD[String] = ???

val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap

回答by javadba

You do not need to convert. The implicits for PairRDDFunctionsdetects a Two-Tuple based RDD and applies the PairRDDFunctions methods automatically.

您不需要转换。PairRDDFunctions的隐式检测基于双元组的 RDD 并自动应用 PairRDDFunctions 方法。