scala 从 Spark 数据框中删除空白字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39951190/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing Blank Strings from a Spark Dataframe
提问by mongolol
Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop()but it turns out many of these values are being encoded as "".
尝试删除 Spark 数据框列包含空字符串的行。最初是这样做的,val df2 = df1.na.drop()但事实证明,其中许多值都被编码为"".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
我坚持使用 Spark 1.3.1,也不能依赖 DSL。(导入 spark.implicit_ 不起作用。)
回答by Kristian
Removing things from a dataframe requires filter().
从数据框中删除内容需要filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
还是我误解了你的问题?
回答by Gaurav Khare
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
如果有人不想删除带有空白字符串的记录,而只是将空白字符串转换为某个常量值。
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
回答by Akshat Chaturvedi
df.filter(!($"col_name"===""))
回答by cody123
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
我也是 spark 新手所以我不知道下面提到的代码是否更复杂,但它有效。
Here we are creating udf which is converting blank values to null.
在这里,我们正在创建 udf,它将空白值转换为 null。
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
在上面的代码之后,您可以在 select 子句中使用“convertToNull”(适用于字符串)并使所有字段为空,然后使用 .na().drop()。
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note :You can use same approach in scala. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html
注意:您可以在 Scala 中使用相同的方法。 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html

