scala 过滤掉某些列的具有 NaN 值的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30475739/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter out rows with NaN values for certain column
提问by Olivier_s_j
I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
我有一个数据集,在某些行中,属性值为NaN. 此数据加载到数据框中,我只想使用由所有属性都有值的行组成的行。我尝试通过 sql 来做:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
我为此尝试了几种变体,但似乎无法使其正常工作。
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN, does not work.
另一种选择是将其转换为 RDD,然后对其进行过滤,因为过滤此数据帧以检查属性isNaN是否不起作用。
采纳答案by Wesley Miao
Here is some sample code that shows you my way of doing it -
这是一些示例代码,向您展示了我的做法 -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df 将有 -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
在 df2 上进行过滤时会给你你想要的 -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
回答by David Griffin
I know you accepted the other answer, but you can do it without the explode(which should perform better than doubling your DataFrame size).
我知道你接受了另一个答案,但你可以不用它explode(这应该比将你的 DataFrame 大小加倍表现更好)。
Prior to Spark 1.6, you could use a udflike this:
在 Spark 1.6 之前,你可以使用udf这样的:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL functionisnan()like this:
从 Spark 1.6 开始,您现在可以像这样使用内置的SQL 函数isnan():
df.filter(isnan($"value"))
回答by hyokyun.park
This works:
这有效:
where isNaN(tau_doc) = false
e.g.
例如
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")

