scala 如何在 Spark 中打印特定 RDD 分区的元素?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30077425/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to print elements of particular RDD partition in Spark?
提问by Arnav
How to print the elements of a particular partition, say 5th, alone?
如何单独打印特定分区的元素,例如第 5 个?
val distData = sc.parallelize(1 to 50, 10)
采纳答案by Fabio Fantoni
Using Spark/Scala:
使用 Spark/Scala:
val data = 1 to 50
val distData = sc.parallelize(data,10)
distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect
produces:
产生:
26
27
28
29
30
回答by urug
you could possible use a counter against foreachPartition() API to achieve it.
您可以使用针对 foreachPartition() API 的计数器来实现它。
Here is a Java program that prints content of each partition JavaSparkContext context = new JavaSparkContext(conf);
这是一个打印每个分区内容的 Java 程序 JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<Integer> myArray = context.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9));
JavaRDD<Integer> partitionedArray = myArray.repartition(2);
System.out.println("partitioned array size is " + partitionedArray.count());
partitionedArray.foreachPartition(new VoidFunction<Iterator<Integer>>() {
public void call(Iterator<Integer> arg0) throws Exception {
while(arg0.hasNext()) {
System.out.println(arg0.next());
}
}
});
回答by Dichen
Assume you do this just for test purpose, then use glom(). See Spark documentation: https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD.glom
假设您这样做只是为了测试目的,然后使用 glom()。请参阅 Spark 文档:https: //spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD.glom
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[1, 2], [3, 4]]
>>> rdd.glom().collect()[1]
[3, 4]
Edit: Example in Scala:
编辑:Scala 中的示例:
scala> val distData = sc.parallelize(1 to 50, 10)
scala> distData.glom().collect()(4)
res2: Array[Int] = Array(21, 22, 23, 24, 25)

