scala 如何打印RDD的内容?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23173488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:12:20  来源:igfitidea点击:

How to print the contents of RDD?

scalaapache-spark

提问by blue-sky

I'm attempting to print the contents of a collection to the Spark console.

我正在尝试将集合的内容打印到 Spark 控制台。

I have a type:

我有一个类型:

linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]

And I use the command:

我使用命令:

scala> linesWithSessionId.map(line => println(line))

But this is printed :

但这是打印的:

res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19

res1:org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19

How can I write the RDD to console or save it to disk so I can view its contents?

如何将 RDD 写入控制台或将其保存到磁盘以便查看其内容?

回答by Oussama

If you want to view the content of a RDD, one way is to use collect():

如果要查看 RDD 的内容,一种方法是使用collect()

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take()to take just a few to print out:

但是,当 RDD 有数十亿行时,这不是一个好主意。使用take()只需取几张即可打印出来:

myRDD.take(n).foreach(println)

回答by fedragon

The mapfunction is a transformation, which means that Spark will not actually evaluate your RDD until you run an actionon it.

map函数是一个转换,这意味着在您对其运行操作之前,Spark 不会实际评估您的 RDD 。

To print it, you can use foreach(which is an action):

要打印它,您可以使用foreach(这是一个操作):

linesWithSessionId.foreach(println)

To write it to disk you can use one of the saveAs...functions (still actions) from the RDD API

要将其写入磁盘,您可以使用RDD API中的saveAs...功能之一(静态操作)

回答by Noah

If you're running this on a cluster then printlnwon't print back to your context. You need to bring the RDDdata to your session. To do this you can force it to local array and then print it out:

如果您在集群上运行它,则println不会打印回您的上下文。您需要将RDD数据带到您的会话中。为此,您可以将其强制为本地数组,然后将其打印出来:

linesWithSessionId.toArray().foreach(line => println(line))

回答by Wesam

You can convert your RDDto a DataFramethen show()it.

您可以将您的转换RDDDataFrame然后show()它。

// For implicit conversion from RDD to DataFrame
import spark.implicits._

fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])

// convert to DF then show it
fruits.toDF().show()

This will show the top 20 lines of your data, so the size of your data should not be an issue.

这将显示数据的前 20 行,因此数据的大小应该不是问题。

+------+---+                                                                    
|    _1| _2|
+------+---+
| apple|  1|
|banana|  2|
|orange| 17|
+------+---+

回答by Harvey

c.take(10)

and Spark newer version will show table nicely.

和 Spark 新版本将很好地显示表格。

回答by Karan Gupta

There are probably many architectural differences between myRDD.foreach(println)and myRDD.collect().foreach(println)(not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

myRDD.foreach(println)myRDD.collect().foreach(println)(不仅是“收集”,还有其他操作)之间可能存在许多架构差异。我看到的不同之处之一是在执行时myRDD.foreach(println),输出将按随机顺序排列。例如:如果我的 rdd 来自每行都有一个数字的文本文件,则输出将具有不同的顺序。但是当我这样做时myRDD.collect().foreach(println),订单仍然就像文本文件一样。

回答by Niranjan Molkeri

In python

在蟒蛇中

   linesWithSessionIdCollect = linesWithSessionId.collect()
   linesWithSessionIdCollect

This will printout all the contents of the RDD

这将打印出 RDD 的所有内容

回答by noego

Instead of typing each time, you can;

您可以不必每次都打字;

[1] Create a generic print method inside Spark Shell.

[1] 在 Spark Shell 中创建一个通用的打印方法。

def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)

[2] Or even better, using implicits, you can add the function to RDD class to print its contents.

[2] 或者更好的是,使用隐式,您可以将函数添加到 RDD 类中以打印其内容。

implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
    def print = rdd.foreach(println)
}

Example usage:

用法示例:

val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)

p(rdd) // 1
rdd.print // 2

Output:

输出:

2
6
4
8

Important

重要的

This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.

这仅在您在本地模式下使用少量数据集工作时才有意义。否则,您要么无法在客户端看到结果,要么由于数据集结果大而耗尽内存。

回答by Thomas Decaux

You can also save as a file: rdd.saveAsTextFile("alicia.txt")

您还可以另存为文件: rdd.saveAsTextFile("alicia.txt")

回答by ForeverLearner

In java syntax:

在 Java 语法中:

rdd.collect().forEach(line -> System.out.println(line));