scala 将 StringType 列添加到现有 Spark DataFrame,然后应用默认值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39962792/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Adding StringType column to existing Spark DataFrame and then applying default values
提问by smeeb
Scala 2.10 here using Spark 1.6.2. I have a similar(but not the same) question as this one, however, the accepted answer is not an SSCCEand assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well asa value for all existing rows in the dataframe.
Scala 2.10 在这里使用 Spark 1.6.2。我有一个类似(但不相同)的问题作为这一个,然而,接受的答案是不是SSCCE并承担一定的“前期知识”关于星火; 因此我无法复制它或理解它。更重要的是,这个问题也仅限于向现有数据帧添加新列,而我需要为数据帧中的所有现有行添加一列和值。
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
因此,我想向现有 Spark DataFrame 添加一列,然后将该新列的初始(“默认”)值应用于所有行。
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
当我运行时,我得到以下作为输出(通过.show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the jsonstring, such that the resultant DF would look like this:
现在我想jsonDF在创建后向 中添加一个新字段并且不修改json字符串,这样生成的 DF 将如下所示:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
意思是,我想一个新的“添加z”列到DF,类型StringType,然后默认情况下所有行包含z的-值"red"。
From that other question I have pieced the following pseudo-code together:
从另一个问题我拼凑了以下伪代码:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...)method:
但是当我运行它时,我在该.withColumn(...)方法上收到编译器错误:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red"as the default value. Any ideas as to where I'm going awry?
我也没有看到任何允许我设置"red"为默认值的API 方法。关于我要去哪里的任何想法?
回答by zero323
You can use litfunction. First you have to import it
您可以使用lit功能。首先你必须导入它
import org.apache.spark.sql.functions.lit
and use it as shown below
并按如下所示使用它
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.
将自动推断列的类型。

