scala 将 StringType 列添加到现有 Spark DataFrame，然后应用默认值

Question

提问by smeeb

Scala 2.10 here using Spark 1.6.2. I have a similar(but not the same) question as this one, however, the accepted answer is not an SSCCEand assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well asa value for all existing rows in the dataframe.

Scala 2.10 在这里使用 Spark 1.6.2。我有一个类似（但不相同）的问题作为这一个，然而，接受的答案是不是SSCCE并承担一定的“前期知识”关于星火; 因此我无法复制它或理解它。更重要的是，这个问题也仅限于向现有数据帧添加新列，而我需要为数据帧中的所有现有行添加一列和值。

So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.

因此，我想向现有 Spark DataFrame 添加一列，然后将该新列的初始（“默认”）值应用于所有行。

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

jsonDF.show()

When I run that I get this following as output (via .show()):

当我运行时，我得到以下作为输出（通过.show()）：

+----+--------+
|   x|       y|
+----+--------+
|true|not true|
+----+--------+

Now I want to add a new field to jsonDF, after it's created and without modifying the jsonstring, such that the resultant DF would look like this:

现在我想jsonDF在创建后向中添加一个新字段并且不修改json字符串，这样生成的 DF 将如下所示：

+----+--------+----+
|   x|       y|   z|
+----+--------+----+
|true|not true| red|
+----+--------+----+

Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".

意思是，我想一个新的“添加z”列到DF，类型StringType，然后默认情况下所有行包含z的-值"red"。

From that other question I have pieced the following pseudo-code together:

从另一个问题我拼凑了以下伪代码：

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

//jsonDF.show()

val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)

newDF.show()

But when I run this, I get a compiler error on that .withColumn(...)method:

但是当我运行它时，我在该.withColumn(...)方法上收到编译器错误：

org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
    at org.apache.spark.sql.DataFrame$$anonfun$resolve.apply(DataFrame.scala:152)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve.apply(DataFrame.scala:152)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
    at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)

I also don't see any API methods that would allow me to set "red"as the default value. Any ideas as to where I'm going awry?

我也没有看到任何允许我设置"red"为默认值的API 方法。关于我要去哪里的任何想法？

Answer 1

回答by zero323

You can use litfunction. First you have to import it

您可以使用lit功能。首先你必须导入它

import org.apache.spark.sql.functions.lit

and use it as shown below

并按如下所示使用它

jsonDF.withColumn("z", lit("red"))

Type of the column will be inferred automatically.

将自动推断列的类型。

scala 将 StringType 列添加到现有 Spark DataFrame，然后应用默认值

提问by smeeb

回答by zero323

相关推荐

最近更新

标签

scala 将 StringType 列添加到现有 Spark DataFrame，然后应用默认值

提问by smeeb

回答by zero323

相关推荐

将 scala 列表转换为 DataFrame 或 DataSet

scala 在Scala中获取两个数字之间的随机数

scala Spark DataFrame 中的条件连接

scala 尝试将数据帧行映射到更新行时出现编码器错误

相关推荐

最近更新

标签