scala 如何在Spark中按键分区RDD?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32544307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to partition RDD by key in Spark?
提问by BAR
Given that the HashPartitioner docs say:
鉴于 HashPartitioner 文档说:
[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.
[HashPartitioner] 使用 Java 的 Object.hashCode 实现基于哈希的分区。
Say I want to partition DeviceDataby its kind.
假设我想DeviceData按其分区kind。
case class DeviceData(kind: String, time: Long, data: String)
Would it be correct to partition an RDD[DeviceData]by overwriting the deviceData.hashCode()method and use only the hashcode of kind?
RDD[DeviceData]通过覆盖deviceData.hashCode()方法来分区 an并仅使用 的哈希码是否正确kind?
But given that HashPartitionertakes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?
但是考虑到它HashPartitioner需要多个分区参数,我很困惑我是否需要提前知道种类的数量以及如果种类多于分区会发生什么?
Is it correct that if I write partitioned data to disk it will stay partitioned when read?
如果我将分区数据写入磁盘,它在读取时将保持分区状态是否正确?
My goal is to call
我的目标是打电话
deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)
And have only DeviceData's of the same kindvalue in the iterator.
并且在迭代器中只有DeviceData's 具有相同的kind值。
采纳答案by Justin Pihony
How about just doing a groupByKeyusing kind. Or another PairRDDFunctionsmethod.
只做一个groupByKeyusing怎么样kind?或者其他PairRDDFunctions方法。
You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?
您让我觉得您并不真正关心分区,只是在一个处理流程中获得所有特定类型?
The pair functions allow this:
配对函数允许:
rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
.foreachPartition(...)
However, you can probably be a little safer with something more like:
但是,您可能会更安全一点:
rdd.keyBy(_.kind).reduceByKey(....)
or mapValuesor a number of the other pair functions that guarantee you get the pieces as a whole
或mapValues或其他一些配对函数,可保证您将所有部分作为一个整体
回答by zero323
Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?
通过覆盖 deviceData.hashCode() 方法对 RDD[DeviceData] 进行分区并仅使用种类的哈希码是否正确?
It wouldn't be. If you take at the Java Object.hashCodedocumentation you'll find following information about the general contract of hashCode:
不会的。如果您阅读 JavaObject.hashCode文档,您会发现以下有关 的一般合同的信息hashCode:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
如果根据 equals(Object) 方法两个对象相等,则对两个对象中的每一个调用 hashCode 方法必须产生相同的整数结果。
So unless notion of equality based purely on a kindof device fits your use case, and I seriously doubt it does, tinkering with HashCodeto get desired partitioning is a bad idea. In general case you should implement your own partitionerbut here it is not required.
因此,除非纯粹基于kind设备的平等概念适合您的用例,并且我严重怀疑它确实如此,否则修补HashCode以获得所需的分区是一个坏主意。在一般情况下,您应该实现自己的分区器,但在这里不是必需的。
Since, excluding specialized scenarios in SQL and GraphX, partitionByis valid only on PairRDDit makes sense to create RDD[(String, DeviceData)]and use plain HashPartitioner
因为,不包括 SQL 和 GraphX 中的特殊场景,partitionBy仅在PairRDD创建RDD[(String, DeviceData)]和使用普通的有意义的时候才有效HashPartitioner
deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))
Just keep in mind that in a situation where kindhas low cardinality or highly skewed distribution using it for partitioning may be not an optimal solution.
请记住,在kind基数较低或分布高度偏斜的情况下,使用它进行分区可能不是最佳解决方案。

