scala 过滤掉某些列的具有 NaN 值的行

Question

提问by Olivier_s_j

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:

我有一个数据集，在某些行中，属性值为NaN. 此数据加载到数据框中，我只想使用由所有属性都有值的行组成的行。我尝试通过 sql 来做：

val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")

I tried several variants on this, but I can't seem to get it working.

我为此尝试了几种变体，但似乎无法使其正常工作。

Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN, does not work.

另一种选择是将其转换为 RDD，然后对其进行过滤，因为过滤此数据帧以检查属性isNaN是否不起作用。

Answer 1

采纳答案by Wesley Miao

Here is some sample code that shows you my way of doing it -

这是一些示例代码，向您展示了我的做法 -

import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))

df will have -

df 将有 -

df.show

id value
1  0.5  
2  NaN

while doing filter on df2 will give you what you want -

在 df2 上进行过滤时会给你你想要的 -

df2.filter($"isNaN" !== true).show

id value isNaN
1  0.5   false

Answer 2

回答by David Griffin

I know you accepted the other answer, but you can do it without the explode(which should perform better than doubling your DataFrame size).

我知道你接受了另一个答案，但你可以不用它explode（这应该比将你的 DataFrame 大小加倍表现更好）。

Prior to Spark 1.6, you could use a udflike this:

在 Spark 1.6 之前，你可以使用udf这样的：

def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))

As of Spark 1.6, you can now use the built-in SQL functionisnan()like this:

从 Spark 1.6 开始，您现在可以像这样使用内置的SQL 函数isnan()：

df.filter(isnan($"value"))

Answer 3

回答by hyokyun.park

This works:

这有效：

where isNaN(tau_doc) = false

e.g.

例如

val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")

scala 过滤掉某些列的具有 NaN 值的行

提问by Olivier_s_j

采纳答案by Wesley Miao

回答by David Griffin

回答by hyokyun.park

相关推荐

最近更新

标签

scala 过滤掉某些列的具有 NaN 值的行

提问by Olivier_s_j

采纳答案by Wesley Miao

回答by David Griffin

回答by hyokyun.park

相关推荐

scala 如何处理 spark 中的错误 SPARK-5063

scala 在 Spark 中将字符串字段转换为时间戳的更好方法

scala 如何在scala中将字符串数组转换为int数组

scala java.sql.SQLException: 将 DataFrame 加载到 Spark SQL 时找不到合适的驱动程序

相关推荐

最近更新

标签