scala 有没有更好的方法来显示整个 Spark SQL DataFrame？

Question

提问by Yuri Brovman

I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show()method:

我想用 Scala API 显示整个 Apache Spark SQL DataFrame。我可以使用以下show()方法：

myDataFrame.show(Int.MaxValue)

Is there a better way to display an entire DataFrame than using Int.MaxValue?

有没有比使用更好的方法来显示整个数据帧Int.MaxValue？

Answer 1

回答by Grega Ke?pret

It is generally not advisable to display an entire DataFrame to stdout, because that means you need to pull the entire DataFrame (all of its values) to the driver (unless DataFrameis already local, which you can check with df.isLocal).

通常不建议将整个 DataFrame 显示到 stdout，因为这意味着您需要将整个 DataFrame（其所有值）拉到驱动程序（除非DataFrame已经是本地的，您可以使用进行检查df.isLocal）。

Unless you know ahead of time that the size of your dataset is sufficiently small so that driver JVM process has enough memory available to accommodate all values, it is not safe to do this. That's why DataFrame API's show()by default shows you only the first 20 rows.

除非您提前知道数据集的大小足够小，以便驱动程序 JVM 进程有足够的可用内存来容纳所有值，否则这样做是不安全的。这就是show()默认情况下DataFrame API只显示前 20 行的原因。

You could use the df.collectwhich returns Array[T]and then iterate over each line and print it:

您可以使用df.collectwhich 返回Array[T]，然后迭代每一行并打印它：

df.collect.foreach(println)

but you lose all formatting implemented in df.showString(numRows: Int)(that show()internally uses).

但是您丢失了df.showString(numRows: Int)（show()内部使用的）中实现的所有格式。

So no, I guess there is no better way.

所以不，我想没有更好的方法。

Answer 2

回答by AkshayK

One way is using count()function to get the total number of records and use show(rdd.count()).

一种方法是使用count()function 获取记录总数并使用show(rdd.count()).

Answer 3

回答by Suresh G

Try with,

试试看，

df.show(35, false)

df.show(35, 假)

It will display 35 rows and 35 column values with full values name.

它将显示具有完整值名称的 35 行和 35 列值。

Answer 4

回答by ayan guha

As others suggested, printing out entire DF is bad idea. However, you can use df.rdd.foreachPartition(f)to print out partition-by-partition without flooding driver JVM (y using collect)

正如其他人所建议的那样，打印出整个 DF 是个坏主意。但是，您可以使用df.rdd.foreachPartition(f)逐个分区打印出驱动程序 JVM（y 使用 collect）

Answer 5

回答by Justin Pihony

Nothing more succinct than that, but if you want to avoid the Int.MaxValue, then you could use a collectand process it, or foreach. But, for a tabular format without much manual code, showis the best you can do.

没有什么比这更简洁的了，但如果你想避免Int.MaxValue，那么你可以使用 acollect并处理它，或者foreach。但是，对于没有太多手动代码的表格格式，这show是您能做的最好的事情。

Answer 6

回答by Rajeev Rathor

In javaI have tried it with two ways. This is working perfectly for me:

在java我尝试了两种方法。这对我来说非常有效：

1.

data.show(SomeNo);

2.

data.foreach(new ForeachFunction<Row>() {
                public void call(Row arg0) throws Exception {
                    System.out.println(arg0);
                }
            });

Answer 7

回答by keypoint

I've tried show()and it seems working sometimes. But sometimes not working, just give it a try:

我试过show()，它有时似乎有效。但有时不工作，试一试：

println(df.show())

scala 有没有更好的方法来显示整个 Spark SQL DataFrame？

提问by Yuri Brovman

回答by Grega Ke?pret

回答by AkshayK

回答by Suresh G

回答by ayan guha

回答by Justin Pihony

回答by Rajeev Rathor

回答by keypoint

相关推荐

最近更新

标签

scala 有没有更好的方法来显示整个 Spark SQL DataFrame？

提问by Yuri Brovman

回答by Grega Ke?pret

回答by AkshayK

回答by Suresh G

回答by ayan guha

回答by Justin Pihony

回答by Rajeev Rathor

回答by keypoint

相关推荐

scala 什么是 TrieMap，与 HashMap 相比，它的优点/缺点是什么？

scala 控制台错误：对象 apache 不是包 org 的成员

scala 如何将环境变量传递给 Jenkins 中的 sbt 测试构建步骤？

scala Spark - 将 CSV 文件加载为 DataFrame？

相关推荐

最近更新

标签