scala 计算 Spark DataFrame 中非空值的数量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/41765739/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count the number of non-null values in a Spark DataFrame
提问by user299791
I have a data frame with some columns, and before doing analysis, I'd like to understand how complete such data frame is, so I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.
我有一个包含一些列的数据框,在进行分析之前,我想了解此类数据框的完整程度,因此我想过滤数据框并为每列计算非空值的数量,可能返回返回一个数据框。
Basically, I am trying to achieve the same result as expressed in this questionbut using Scala instead of Python...
基本上,我试图实现与此问题中表达的结果相同的结果,但使用 Scala 而不是 Python ...
Say you have:
说你有:
val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()
How can you summarize the number of non-null for each column and return a dataframe with the same number of column and just a single row with the answer?
您如何总结每列的非空数量并返回具有相同列数和只有一行答案的数据框?
回答by Psidom
One straight forward option is to use .describe()function to get a summary of your data frame, where the count row includes a count of non-null values:
一个直接的选择是使用.describe()函数来获取数据框的摘要,其中计数行包括非空值的计数:
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+
回答by Raphael Roth
Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...
尽管我喜欢 Psidoms 的答案,但我通常对空值的比例更感兴趣,因为仅非空值的数量并不能说明太多......
You can do something like:
您可以执行以下操作:
import org.apache.spark.sql.functions.{sum,when, count}
df.agg(
   (sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
   (sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
   (sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
 ).show()
EDIT: sum(when($"x".isNotNull,0).otherwise(1))can also just be replaced by count($"x")which only counts non-null values. As I find this not obvious, I tend to use the sumnotation which is more clear
编辑:sum(when($"x".isNotNull,0).otherwise(1))也可以只被替换为count($"x")只计算非空值。因为我觉得这并不明显,所以我倾向于使用sum更清晰的符号
回答by Birchlabs
Here's how I did it in Scala 2.11, Spark 2.3.1:
这是我在 Scala 2.11、Spark 2.3.1 中的做法:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.agg(
    count("x").divide(count(lit(1)))
        .as("x: percent non-null")
    // ...copy paste that for columns y and z
).head()
count(*)counts non-null rows, count(1)runs on every row.
count(*)计算非空行,count(1)在每一行上运行。
If you instead want to count percent nullin population, find the complement of our count-based equation:
如果您想计算人口中的null百分比,请找到我们基于计数的等式的补充:
lit(1).minus(
    count("x").divide(count(lit(1)))
    )
    .as("x: percent null")
It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:
同样值得知道的是,您可以将 nullness 转换为整数,然后将其求和。
但它的性能可能较差:
// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
    .divide(count(lit(1)))
    .as("x: percent null")
回答by KayV
Here is the simplest query:
这是最简单的查询:
d.filter($"x" !== null ).count

