Python 在 DataFrame 中用 None/null 值替换空字符串

Question

提问by dnlbrky

I have a Spark 1.5.0 DataFramewith a mix of nulland empty strings in the same column. I want to convert all empty strings in all columns to null(None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

我有一个Spark 1.5.0 DataFrame，null在同一列中混合了空字符串。我想将所有列中的所有空字符串转换为null( None, 在 Python 中)。DataFrame 可能有数百列，所以我试图避免对每一列进行硬编码操作。

See my attempt below, which results in an error.

请参阅下面我的尝试，这会导致错误。

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |    |   2|
## |null|null|
## +----+----+

## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## +----+----+

Answer 1

回答by zero323

It is as simple as this:

就这么简单：

from pyspark.sql.functions import col, when

def blank_as_null(x):
    return when(col(x) != "", col(x)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## |null|null|
## +----+----+

dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## +----+----+

If you want to fill multiple columns you can for example reduce:

如果你想填充多列，你可以例如减少：

to_convert = set([...]) # Some set of columns

reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

or use comprehension:

或使用理解：

exprs = [
    blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]

testDF.select(*exprs)

If you want to specifically operate on string fields please check the answerby robin-loxley.

如果要专门对字符串字段操作请检查答案由罗宾洛克斯利。

Answer 2

回答by soulmachine

My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:

我的解决方案比我目前看到的所有解决方案都要好得多，它可以处理任意数量的字段，查看小功能如下：

  // Replace empty Strings with null values
  private def setEmptyToNull(df: DataFrame): DataFrame = {
    val exprs = df.schema.map { f =>
      f.dataType match {
        case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
        case _ => col(f.name)
      }
    }

    df.select(exprs: _*)
  }

You can easily rewrite the function above in Python.

你可以很容易地用 Python 重写上面的函数。

I learned this trick from @liancheng

我从@liancheng学到了这个技巧

Answer 3

回答by Robin Loxley

Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.

只需添加 zero323 和 soulmachine 的答案即可。转换所有 StringType 字段。

from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
    if isinstance(f.dataType, StringType):
        string_fields.append(f.name)

Answer 4

回答by bloodrootfc

UDFs are not terribly efficient. The correct way to do this using a built-in method is:

UDF 并不是非常有效。使用内置方法执行此操作的正确方法是：

df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))

Answer 5

回答by lfk

This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:

这是soulmachine解决方案的不同版本，但我认为您无法轻松将其转换为Python：

def emptyStringsToNone(df: DataFrame): DataFrame = {
  df.schema.foldLeft(df)(
    (current, field) =>
      field.dataType match {
        case DataTypes.StringType =>
          current.withColumn(
            field.name,
            when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
          )
        case _ => current
      }
  )
}

Python 在 DataFrame 中用 None/null 值替换空字符串

提问by dnlbrky

回答by zero323

回答by soulmachine

回答by Robin Loxley

回答by bloodrootfc

回答by lfk

相关推荐

最近更新

标签

Python 在 DataFrame 中用 None/null 值替换空字符串

提问by dnlbrky

回答by zero323

回答by soulmachine

回答by Robin Loxley

回答by bloodrootfc

回答by lfk

相关推荐

Python 将熊猫数据框转换为系列

Python 获取 psycopg2 count(*) 结果数

Python 如何绕过 Scrapy 中的 cloudflare bot/ddos 保护？

Python AttributeError: 'module' 对象没有属性 'utcnow'

相关推荐

最近更新

标签