Python 向 Spark DataFrame 添加一个空列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33038686/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Add an empty column to Spark DataFrame
提问by architectonic
As mentioned in manyother locationson the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrame
s using unionAll
.
正如网络上的许多其他位置所述,向现有 DataFrame 添加新列并不简单。不幸的是,拥有此功能很重要(即使它在分布式环境中效率低下),尤其是在尝试DataFrame
使用unionAll
.
What is the most elegant workaround for adding a null
column to a DataFrame
to facilitate a unionAll
?
将null
列添加到 aDataFrame
以促进 a的最优雅的解决方法是unionAll
什么?
My version goes like this:
我的版本是这样的:
from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))
采纳答案by zero323
All you need here is a literal and cast:
这里你需要的只是一个文字和演员:
from pyspark.sql.functions import lit
new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))
A full example:
一个完整的例子:
df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()
## root
## |-- foo: long (nullable = true)
## |-- bar: string (nullable = true)
new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()
## root
## |-- foo: long (nullable = true)
## |-- bar: string (nullable = true)
## |-- new_column: string (nullable = true)
new_df.show()
## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## | 1| 2| null|
## | 2| 3| null|
## +---+---+----------+
A Scala equivalent can be found here: Create new Dataframe with empty/null field values
可以在此处找到 Scala 等效项:Create new Dataframe with empty/null field values
回答by Shrikant Prabhu
I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows
我会将 lit(None) 转换为 NullType 而不是 StringType。因此,如果我们必须过滤掉该列上的非空行...可以按如下方式轻松完成
df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()
new_df = df.withColumn('new_column', lit(None).cast(NullType()))
new_df.printSchema()
df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()
Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").
如果您要强制转换为 StringType,请注意不要使用 lit("None")(带引号),因为它会无法在 col("new_column") 上搜索具有过滤条件 .isNull() 的记录。