如何处理数据帧 Spark/Scala 上的空/空值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37189659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:17:16  来源:igfitidea点击:

How to handle the null/empty values on a dataframe Spark/Scala

scalaapache-spark

提问by Ectoras

I have a CSV file and I am processing its data.

我有一个 CSV 文件,我正在处理它的数据。

I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.

我正在处理数据框,并根据某些条件计算每列的平均值、最小值、最大值、平均值、总和。每列的数据可以为空或为空。

I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.

我注意到在某些情况下我得到了最大值,或者求和一个空值而不是一个数字。或者我在 max() 中得到一个小于 min() 返回的输出的数字。

I do not want to replace the null/empty values with other. The only thing I have done is to use these 2 options in CSV:

我不想用其他替换空/空值。我所做的唯一一件事就是在 CSV 中使用这两个选项:

.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")

Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?

有没有办法处理这个问题?大家以前都遇到过这个问题吗?是数据类型的问题吗?

I run something like this:

我运行这样的东西:

data.agg(mean("col_name"), stddev("col_name"),count("col_name"), 
         min("col_name"), max("col_name"))

Otherwise I can consider that it is a problem in my code.

否则我可以认为这是我的代码中的问题。

回答by Z. Simon

I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results. Environment: Scala, Spark 1.6.1 Hadoop 2.6.0

我对这个问题做了一些研究,结果表明 mean、max、min 函数忽略空值。下面是实验代码和结果。环境:Scala、Spark 1.6.1 Hadoop 2.6.0

import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}

val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))

val schema = StructType(StructField("key", StringType, false) ::
                    StructField("value", DoubleType, true) ::
                    StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)

df.show()

df.agg(mean($"value"), max($"value"), min($"value")).show()

df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()


Output:
+---+-----+----------+
|key|value|         d|
+---+-----+----------+
|  1|  2.4|2016-12-21|
|  1| null|2016-12-22|
|  2| null|2016-12-23|
|  2| null|2016-12-23|
|  3|  3.0|2016-12-22|
|  3|  2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
|       avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667|       3.0|       2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
|  1|       2.4|       2.4|       2.4|
|  2|      null|      null|      null|
|  3|       2.5|       3.0|       2.0|
+---+----------+----------+----------+

From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.

从输出中,您可以看到 group key='1' 的列 'value' 上的 mean、max、min 函数返回 '2.4' 而不是 null,这表明这些函数中忽略了空值。但是,如果该列仅包含空值,则这些函数将返回空值。

回答by javadba

Contrary to one of the comments it is nottrue that nulls are ignored. Here is an approach:

与评论之一相反,忽略空值是正确的。这是一种方法:

max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))

This will still have an issue if there were onlynull values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".

如果只有空值,这仍然会有问题:您需要将 Min/MaxValue 转换为 null 或任何您想用来表示“无有效/非空条目”的内容。