Python 向 Spark DataFrame 添加一个空列

Question

提问by architectonic

As mentioned in many other locationson the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

正如网络上的许多其他位置所述，向现有 DataFrame 添加新列并不简单。不幸的是，拥有此功能很重要（即使它在分布式环境中效率低下），尤其是在尝试DataFrame使用unionAll.

What is the most elegant workaround for adding a nullcolumn to a DataFrameto facilitate a unionAll?

将null列添加到 aDataFrame以促进 a的最优雅的解决方法是unionAll什么？

My version goes like this:

我的版本是这样的：

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

Answer 1

采纳答案by zero323

All you need here is a literal and cast:

这里你需要的只是一个文字和演员：

from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

一个完整的例子：

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)
##  |-- new_column: string (nullable = true)

new_df.show()

## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## |  1|  2|      null|
## |  2|  3|      null|
## +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

可以在此处找到 Scala 等效项：Create new Dataframe with empty/null field values

Answer 2

回答by Shrikant Prabhu

I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

我会将 lit(None) 转换为 NullType 而不是 StringType。因此，如果我们必须过滤掉该列上的非空行...可以按如下方式轻松完成

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

如果您要强制转换为 StringType，请注意不要使用 lit("None")(带引号)，因为它会无法在 col("new_column") 上搜索具有过滤条件 .isNull() 的记录。

Python 向 Spark DataFrame 添加一个空列

提问by architectonic

采纳答案by zero323

回答by Shrikant Prabhu

相关推荐

最近更新

标签

Python 向 Spark DataFrame 添加一个空列

提问by architectonic

采纳答案by zero323

回答by Shrikant Prabhu

相关推荐

Python ImportError 没有名为 crypto.PublicKey.RSA 的模块

Python 在熊猫中将浮点系列转换为整数系列

如何使用密钥而不是基本身份验证用户名和密码将 Python 连接到 RESTful API？

Python pip 无法安装 numpy 错误代码 1

相关推荐

最近更新

标签