基于第一个数据帧 Java 中的列创建一个具有新列值的新 Spark DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37090496/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating a new Spark DataFrame with new column value based on column in first dataframe Java
提问by user1128482
This should be easy but....using Spark 1.6.1.... I have DataFrame #1 with columns A, B, C. With Values:
这应该很容易,但是....使用 Spark 1.6.1.... 我有 DataFrame #1 和 A、B、C 列。有值:
A B C
1 2 A
2 2 A
3 2 B
4 2 C
I then create a new dataframe with a new column D so:
然后我创建一个带有新列 D 的新数据框,因此:
DataFrame df2 = df1.withColumn("D", df1.col("C"));
so far so good but I actually want the value in column D to be conditional ie:
到目前为止一切顺利,但我实际上希望 D 列中的值是有条件的,即:
// pseudo code
if (col C = "A") the col D = "X"
else if (col C = "B") the col D = "Y"
else col D = "Z"
I'll then drop column C and rename D to C. I've tried looking at the Column functions but nothing appears to fit the bill; I thought of using df1.rdd().map() and iterating over the rows but aside from not actually managing to get it to work, I kind of thought that the whole point of DataFrames was to move away from the RDD abstraction?
然后我将删除 C 列并将 D 重命名为 C。我尝试查看 Column 函数,但似乎没有任何内容符合要求;我想过使用 df1.rdd().map() 并遍历行,但除了实际上没有设法让它工作之外,我有点认为 DataFrames 的重点是摆脱 RDD 抽象?
Unfortunately I have to do this in Java (and of course Spark with Java is not optimal!!). It seems like I'm missing the obvious and am happy to be shown to be an idiot when presented with the solution!
不幸的是,我必须在 Java 中执行此操作(当然,使用 Java 的 Spark 并不是最佳选择!!)。似乎我错过了显而易见的事情,并且很高兴在提出解决方案时被证明是个白痴!
采纳答案by Daniel de Paula
I believe you can use when
to achieve that. Additionally, you probably can replace the old column directly. For your example, the code would be something like:
我相信您可以使用它when
来实现这一目标。此外,您可能可以直接替换旧列。对于您的示例,代码将类似于:
import static org.apache.spark.sql.functions.*;
Column newCol = when(col("C").equalTo("A"), "X")
.when(col("C").equalTo("B"), "Y")
.otherwise("Z");
DataFrame df2 = df1.withColumn("C", newCol);
For more details about when
, check the Column
Javadoc.
有关更多详细信息when
,请查看Column
Javadoc。
回答by user1128482
Thanks to Daniel I have resolved this :)
感谢丹尼尔我已经解决了这个问题:)
The missing piece was the static import of the sql functions
缺少的部分是 sql 函数的静态导入
import static org.apache.spark.sql.functions.*;
I must have tried a million different ways of using when, but got compile failures/runtime errors because I didn't do the import. Once imported Daniel's answer was spot on!
我一定已经尝试过一百万种不同的 when 使用方式,但是因为我没有进行导入,所以出现了编译失败/运行时错误。一旦导入丹尼尔的答案就在现场!
回答by sudeepgupta90
You may also use udf's to do the same job. Just write a simple if then else structure
您也可以使用 udf 来完成相同的工作。只写一个简单的 if then else 结构
import org.apache.spark.sql.functions.udf
val customFunct = udf { d =>
//if then else construct
}
val new_DF= df.withColumn(column_name, customFunct(df("data_column")))