Python 向 Spark DataFrame 添加一个空列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33038686/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:38:52  来源:igfitidea点击:

Add an empty column to Spark DataFrame

pythonapache-sparkdataframepysparkapache-spark-sql

提问by architectonic

As mentioned in manyother locationson the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

正如网络上的许多其他位置所述,向现有 DataFrame 添加新列并不简单。不幸的是,拥有此功能很重要(即使它在分布式环境中效率低下),尤其是在尝试DataFrame使用unionAll.

What is the most elegant workaround for adding a nullcolumn to a DataFrameto facilitate a unionAll?

null列添加到 aDataFrame以促进 a的最优雅的解决方法是unionAll什么?

My version goes like this:

我的版本是这样的:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

采纳答案by zero323

All you need here is a literal and cast:

这里你需要的只是一个文字和演员:

from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

一个完整的例子:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)
##  |-- new_column: string (nullable = true)

new_df.show()

## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## |  1|  2|      null|
## |  2|  3|      null|
## +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

可以在此处找到 Scala 等效项:Create new Dataframe with empty/null field values

回答by Shrikant Prabhu

I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

我会将 lit(None) 转换为 NullType 而不是 StringType。因此,如果我们必须过滤掉该列上的非空行...可以按如下方式轻松完成

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

如果您要强制转换为 StringType,请注意不要使用 lit("None")(带引号),因为它会无法在 col("new_column") 上搜索具有过滤条件 .isNull() 的记录。