scala 创建具有空/空字段值的新数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32067467/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create new Dataframe with empty/null field values
提问by sshroff
I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.
我正在从现有的数据帧创建一个新的数据帧,但需要在这个新的 DF 中添加新列(下面代码中的“field1”)。我该怎么做?工作示例代码示例将不胜感激。
val edwDf = omniDataFrame
.withColumn("field1", callUDF((value: String) => None))
.withColumn("field2",
callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df"))))
edwDf
.select("field1", "field2")
.save("odsoutdatafldr", "com.databricks.spark.csv");
回答by zero323
It is possible to use lit(null):
可以使用lit(null):
import org.apache.spark.sql.functions.{lit, udf}
case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF
val dfWithFoobar = df.withColumn("foobar", lit(null: String))
One problem here is that the column type is null:
这里的一个问题是列类型是null:
scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csvwriter. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType
并且它没有被csv作者保留。如果这是一个硬性要求,您可以将列强制转换为特定类型(比如字符串),使用任一DataType
import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))
or string description
或字符串描述
df.withColumn("foobar", lit(null).cast("string"))
or use an UDF like this:
或使用像这样的 UDF:
val getNull = udf(() => None: Option[String]) // Or some other type
df.withColumn("foobar", getNull()).printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: string (nullable = true)
A Python equivalent can be found here: Add an empty column to spark DataFrame
可以在此处找到 Python 等效项:Add a empty column to spark DataFrame
回答by sanyi14ka
Just to extend the perfect answer provided by @zero323, here's a solution which can be used starting from Spark 2.2.0.
只是为了扩展@zero323 提供的完美答案,这里有一个可以从 Spark 2.2.0 开始使用的解决方案。
import org.apache.spark.sql.functions.typedLit
df.withColumn("foobar", typedLit[Option[String]](None)).printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: string (nullable = true)
It's similar to the 3rd solution, but without using any UDF.
它类似于第三个解决方案,但不使用任何 UDF。

