如何将 Scala RDD 转换为 Map

Question

提问by Soumitra

I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18]and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId' but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]'but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

我有一个 RDD（字符串数组）org.apache.spark.rdd.RDD[String] = MappedRDD[18]并将其转换为具有唯一 ID 的地图。我做了 ' val vertexMAp = vertices.zipWithUniqueId' 但这给了我另一个类型的 RDD'org.apache.spark.rdd.RDD[(String, Long)]'但我想要一个 ' Map[String, Long]' 。我怎样才能转换我的 ' org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ？

Thanks

谢谢

Answer 1

回答by maasg

There's a built-in collectAsMapfunction in PairRDDFunctionsthat would deliver you a map of the pair values in the RDD.

有一个内置collectAsMap函数PairRDDFunctions可以为您提供 RDD 中成对值的映射。

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.

重要的是要记住，RDD 是一种分布式数据结构。您可以将其可视化为分布在集群中的数据的“片段”。当你collect，你强迫所有这些碎片去驱动程序并且能够做到这一点时，它们需要适合驱动程序的内存。

From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.

从评论来看，在您的情况下，您需要处理大型数据集。用它制作 Map 是行不通的，因为它不适合驱动程序的内存；如果您尝试会导致 OOM 异常。

You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookupon a PairRDD instead, like this:

您可能需要将数据集保留为 RDD。如果您要创建 Map 以查找元素，则可以lookup在 PairRDD 上使用，如下所示：

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")

Answer 2

回答by Eugene Zhulenev

Collect to "local" machine and then convert Array[(String, Long)] to Map

收集到“本地”机器，然后将 Array[(String, Long)] 转换为 Map

val rdd: RDD[String] = ???

val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap

Answer 3

回答by javadba

You do not need to convert. The implicits for PairRDDFunctionsdetects a Two-Tuple based RDD and applies the PairRDDFunctions methods automatically.

您不需要转换。PairRDDFunctions的隐式检测基于双元组的 RDD 并自动应用 PairRDDFunctions 方法。

如何将 Scala RDD 转换为 Map

提问by Soumitra

回答by maasg

回答by Eugene Zhulenev

回答by javadba

相关推荐

最近更新

标签

如何将 Scala RDD 转换为 Map

提问by Soumitra

回答by maasg

回答by Eugene Zhulenev

回答by javadba

相关推荐

IntelliJ Scala 配置问题

Scala 中的高效字符串连接

“使用/尝试资源”的简单 Scala 模式（自动资源管理）

scala Map中mapValues和transform的区别

相关推荐

最近更新

标签