除了使用 Java 在 Spark 中进行计数检查之外,还有哪些方法可以检查 DataFrames 是否为空?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44123889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 07:57:30  来源:igfitidea点击:

What are the ways to check if DataFrames are empty other than doing a count check in Spark using Java?

javaapache-sparkdataframenullpointerexceptionspark-dataframe

提问by user5626966

if(df.count()== 0){
    System.out.println("df is an empty dataframe");
 }

The above is a way to check if a DataFrame is empty or not without getting a null pointer exception.

以上是一种在不获取空指针异常的情况下检查 DataFrame 是否为空的方法。

Is there any other best way to do so in Spark as I am worried that if the DataFrame df gets millions of records, the above statement will be taking a long time to get executed.

在 Spark 中是否还有其他最佳方法可以这样做,因为我担心如果 DataFrame df 获得数百万条记录,则上述语句将需要很长时间才能执行。

回答by Devendra Lattu

Taking the count can be slower. Instead you can just check if the head element is not empty.

进行计数可能会更慢。相反,您可以只检查 head 元素是否不为空。

df.head(1).isEmpty

Add an exception handling to this as it will throw java.util.NoSuchElementExceptionif df is empty.

对此添加异常处理,因为java.util.NoSuchElementException如果 df 为空,它将抛出。

Update: Check out How to check if spark dataframe is empty

更新:查看如何检查火花数据框是否为空

回答by Sivaprasanna Sethuraman

I recently come across one such scenario. The following are some of the ways to check if a dataframeis empty.

我最近遇到了一个这样的场景。下面是一些检查adataframe是否为空的方法。

  • df.count() == 0
  • df.head().isEmpty
  • df.rdd.isEmpty
  • df.first().isEmpty
  • df.count() == 0
  • df.head().isEmpty
  • df.rdd.isEmpty
  • df.first().isEmpty

Although it is better to avoid count()since it is more expensive. However there might be some situations where you are very certain that the dataframewould have either a single row or no record at all (For ex: Executing a max()function in an Hive query). In such situations, it is okay to use count().

虽然最好避免,count()因为它更贵。但是,在某些情况下,您可能非常确定dataframe将只有一行或根本没有记录(例如:max()在 Hive 查询中执行函数)。在这种情况下,可以使用count().