Python 在 DataFrame 中用 None/null 值替换空字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33287886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replace empty strings with None/null values in DataFrame
提问by dnlbrky
I have a Spark 1.5.0 DataFramewith a mix of null
and empty strings in the same column. I want to convert all empty strings in all columns to null
(None
, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
我有一个Spark 1.5.0 DataFrame,null
在同一列中混合了空字符串。我想将所有列中的所有空字符串转换为null
( None
, 在 Python 中)。DataFrame 可能有数百列,所以我试图避免对每一列进行硬编码操作。
See my attempt below, which results in an error.
请参阅下面我的尝试,这会导致错误。
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
回答by zero323
It is as simple as this:
就这么简单:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
如果你想填充多列,你可以例如减少:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
或使用理解:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answerby robin-loxley.
回答by soulmachine
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
我的解决方案比我目前看到的所有解决方案都要好得多,它可以处理任意数量的字段,查看小功能如下:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
你可以很容易地用 Python 重写上面的函数。
I learned this trick from @liancheng
我从@liancheng学到了这个技巧
回答by Robin Loxley
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
只需添加 zero323 和 soulmachine 的答案即可。转换所有 StringType 字段。
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
回答by bloodrootfc
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
UDF 并不是非常有效。使用内置方法执行此操作的正确方法是:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
回答by lfk
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
这是soulmachine解决方案的不同版本,但我认为您无法轻松将其转换为Python:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}