scala 计算 Spark DataFrame 中非空值的数量

Question

提问by user299791

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete such data frame is, so I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.

我有一个包含一些列的数据框，在进行分析之前，我想了解此类数据框的完整程度，因此我想过滤数据框并为每列计算非空值的数量，可能返回返回一个数据框。

Basically, I am trying to achieve the same result as expressed in this questionbut using Scala instead of Python...

基本上，我试图实现与此问题中表达的结果相同的结果，但使用 Scala 而不是 Python ...

Say you have:

说你有：

val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()

How can you summarize the number of non-null for each column and return a dataframe with the same number of column and just a single row with the answer?

您如何总结每列的非空数量并返回具有相同列数和只有一行答案的数据框？

Answer 1

回答by Psidom

One straight forward option is to use .describe()function to get a summary of your data frame, where the count row includes a count of non-null values:

一个直接的选择是使用.describe()函数来获取数据框的摘要，其中计数行包括非空值的计数：

df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

Answer 2

回答by Raphael Roth

Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...

尽管我喜欢 Psidoms 的答案，但我通常对空值的比例更感兴趣，因为仅非空值的数量并不能说明太多......

You can do something like:

您可以执行以下操作：

import org.apache.spark.sql.functions.{sum,when, count}

df.agg(
   (sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
   (sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
   (sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
 ).show()

EDIT: sum(when($"x".isNotNull,0).otherwise(1))can also just be replaced by count($"x")which only counts non-null values. As I find this not obvious, I tend to use the sumnotation which is more clear

编辑：sum(when($"x".isNotNull,0).otherwise(1))也可以只被替换为count($"x")只计算非空值。因为我觉得这并不明显，所以我倾向于使用sum更清晰的符号

Answer 3

回答by Birchlabs

Here's how I did it in Scala 2.11, Spark 2.3.1:

这是我在 Scala 2.11、Spark 2.3.1 中的做法：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

df.agg(
    count("x").divide(count(lit(1)))
        .as("x: percent non-null")
    // ...copy paste that for columns y and z
).head()

count(*)counts non-null rows, count(1)runs on every row.

count(*)计算非空行，count(1)在每一行上运行。

If you instead want to count percent nullin population, find the complement of our count-based equation:

如果您想计算人口中的null百分比，请找到我们基于计数的等式的补充：

lit(1).minus(
    count("x").divide(count(lit(1)))
    )
    .as("x: percent null")

It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:

同样值得知道的是，您可以将 nullness 转换为整数，然后将其求和。
但它的性能可能较差：

// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
    .divide(count(lit(1)))
    .as("x: percent null")

Answer 4

回答by KayV

Here is the simplest query:

这是最简单的查询：

d.filter($"x" !== null ).count

scala 计算 Spark DataFrame 中非空值的数量

提问by user299791

回答by Psidom

回答by Raphael Roth

回答by Birchlabs

回答by KayV

相关推荐

最近更新

标签

scala 计算 Spark DataFrame 中非空值的数量

提问by user299791

回答by Psidom

回答by Raphael Roth

回答by Birchlabs

回答by KayV

相关推荐

如何在终端中运行 Scala 程序？

scala Spark Dataframes - 按键减少

scala 使用 Spark DataFrame 将数据插入 Cassandra 表

scala SparkSQL - 直接读取镶木地板文件

相关推荐

最近更新

标签