scala 如何打印RDD的内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23173488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to print the contents of RDD?
提问by blue-sky
I'm attempting to print the contents of a collection to the Spark console.
我正在尝试将集合的内容打印到 Spark 控制台。
I have a type:
我有一个类型:
linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]
And I use the command:
我使用命令:
scala> linesWithSessionId.map(line => println(line))
But this is printed :
但这是打印的:
res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19
res1:org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19
How can I write the RDD to console or save it to disk so I can view its contents?
如何将 RDD 写入控制台或将其保存到磁盘以便查看其内容?
回答by Oussama
If you want to view the content of a RDD, one way is to use collect():
如果要查看 RDD 的内容,一种方法是使用collect():
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take()to take just a few to print out:
但是,当 RDD 有数十亿行时,这不是一个好主意。使用take()只需取几张即可打印出来:
myRDD.take(n).foreach(println)
回答by fedragon
The mapfunction is a transformation, which means that Spark will not actually evaluate your RDD until you run an actionon it.
该map函数是一个转换,这意味着在您对其运行操作之前,Spark 不会实际评估您的 RDD 。
To print it, you can use foreach(which is an action):
要打印它,您可以使用foreach(这是一个操作):
linesWithSessionId.foreach(println)
To write it to disk you can use one of the saveAs...functions (still actions) from the RDD API
要将其写入磁盘,您可以使用RDD API中的saveAs...功能之一(静态操作)
回答by Noah
If you're running this on a cluster then printlnwon't print back to your context. You need to bring the RDDdata to your session. To do this you can force it to local array and then print it out:
如果您在集群上运行它,则println不会打印回您的上下文。您需要将RDD数据带到您的会话中。为此,您可以将其强制为本地数组,然后将其打印出来:
linesWithSessionId.toArray().foreach(line => println(line))
回答by Wesam
You can convert your RDDto a DataFramethen show()it.
您可以将您的转换RDD为DataFrame然后show()它。
// For implicit conversion from RDD to DataFrame
import spark.implicits._
fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])
// convert to DF then show it
fruits.toDF().show()
This will show the top 20 lines of your data, so the size of your data should not be an issue.
这将显示数据的前 20 行,因此数据的大小应该不是问题。
+------+---+
| _1| _2|
+------+---+
| apple| 1|
|banana| 2|
|orange| 17|
+------+---+
回答by Harvey
c.take(10)
and Spark newer version will show table nicely.
和 Spark 新版本将很好地显示表格。
回答by Karan Gupta
There are probably many architectural differences between myRDD.foreach(println)and myRDD.collect().foreach(println)(not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.
myRDD.foreach(println)和myRDD.collect().foreach(println)(不仅是“收集”,还有其他操作)之间可能存在许多架构差异。我看到的不同之处之一是在执行时myRDD.foreach(println),输出将按随机顺序排列。例如:如果我的 rdd 来自每行都有一个数字的文本文件,则输出将具有不同的顺序。但是当我这样做时myRDD.collect().foreach(println),订单仍然就像文本文件一样。
回答by Niranjan Molkeri
In python
在蟒蛇中
linesWithSessionIdCollect = linesWithSessionId.collect()
linesWithSessionIdCollect
This will printout all the contents of the RDD
这将打印出 RDD 的所有内容
回答by noego
Instead of typing each time, you can;
您可以不必每次都打字;
[1] Create a generic print method inside Spark Shell.
[1] 在 Spark Shell 中创建一个通用的打印方法。
def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)
[2] Or even better, using implicits, you can add the function to RDD class to print its contents.
[2] 或者更好的是,使用隐式,您可以将函数添加到 RDD 类中以打印其内容。
implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
def print = rdd.foreach(println)
}
Example usage:
用法示例:
val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)
p(rdd) // 1
rdd.print // 2
Output:
输出:
2
6
4
8
Important
重要的
This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.
这仅在您在本地模式下使用少量数据集工作时才有意义。否则,您要么无法在客户端看到结果,要么由于数据集结果大而耗尽内存。
回答by Thomas Decaux
You can also save as a file: rdd.saveAsTextFile("alicia.txt")
您还可以另存为文件: rdd.saveAsTextFile("alicia.txt")
回答by ForeverLearner
In java syntax:
在 Java 语法中:
rdd.collect().forEach(line -> System.out.println(line));

