基于第一个数据帧 Java 中的列创建一个具有新列值的新 Spark DataFrame

Question

提问by user1128482

This should be easy but....using Spark 1.6.1.... I have DataFrame #1 with columns A, B, C. With Values:

这应该很容易，但是....使用 Spark 1.6.1.... 我有 DataFrame #1 和 A、B、C 列。有值：

I then create a new dataframe with a new column D so:

然后我创建一个带有新列 D 的新数据框，因此：

DataFrame df2 = df1.withColumn("D", df1.col("C"));

so far so good but I actually want the value in column D to be conditional ie:

到目前为止一切顺利，但我实际上希望 D 列中的值是有条件的，即：

// pseudo code
if (col C = "A") the col D = "X"
else if (col C = "B") the col D = "Y"
else col D = "Z"

I'll then drop column C and rename D to C. I've tried looking at the Column functions but nothing appears to fit the bill; I thought of using df1.rdd().map() and iterating over the rows but aside from not actually managing to get it to work, I kind of thought that the whole point of DataFrames was to move away from the RDD abstraction?

然后我将删除 C 列并将 D 重命名为 C。我尝试查看 Column 函数，但似乎没有任何内容符合要求；我想过使用 df1.rdd().map() 并遍历行，但除了实际上没有设法让它工作之外，我有点认为 DataFrames 的重点是摆脱 RDD 抽象？

Unfortunately I have to do this in Java (and of course Spark with Java is not optimal!!). It seems like I'm missing the obvious and am happy to be shown to be an idiot when presented with the solution!

不幸的是，我必须在 Java 中执行此操作（当然，使用 Java 的 Spark 并不是最佳选择！！）。似乎我错过了显而易见的事情，并且很高兴在提出解决方案时被证明是个白痴！

Answer 1

采纳答案by Daniel de Paula

I believe you can use whento achieve that. Additionally, you probably can replace the old column directly. For your example, the code would be something like:

我相信您可以使用它when来实现这一目标。此外，您可能可以直接替换旧列。对于您的示例，代码将类似于：

import static org.apache.spark.sql.functions.*;

Column newCol = when(col("C").equalTo("A"), "X")
    .when(col("C").equalTo("B"), "Y")
    .otherwise("Z");

DataFrame df2 = df1.withColumn("C", newCol);

For more details about when, check the ColumnJavadoc.

有关更多详细信息when，请查看ColumnJavadoc。

Answer 2

回答by user1128482

Thanks to Daniel I have resolved this :)

感谢丹尼尔我已经解决了这个问题:)

The missing piece was the static import of the sql functions

缺少的部分是 sql 函数的静态导入

import static org.apache.spark.sql.functions.*;

I must have tried a million different ways of using when, but got compile failures/runtime errors because I didn't do the import. Once imported Daniel's answer was spot on!

我一定已经尝试过一百万种不同的 when 使用方式，但是因为我没有进行导入，所以出现了编译失败/运行时错误。一旦导入丹尼尔的答案就在现场！

Answer 3

回答by sudeepgupta90

You may also use udf's to do the same job. Just write a simple if then else structure

您也可以使用 udf 来完成相同的工作。只写一个简单的 if then else 结构

import org.apache.spark.sql.functions.udf
val customFunct = udf { d =>
      //if then else construct
    }

val new_DF= df.withColumn(column_name, customFunct(df("data_column")))

基于第一个数据帧 Java 中的列创建一个具有新列值的新 Spark DataFrame

提问by user1128482

采纳答案by Daniel de Paula

回答by user1128482

回答by sudeepgupta90

相关推荐

最近更新

标签

基于第一个数据帧 Java 中的列创建一个具有新列值的新 Spark DataFrame

提问by user1128482

采纳答案by Daniel de Paula

回答by user1128482

回答by sudeepgupta90

相关推荐

Java 如何使用 Jackson 注释将嵌套值映射到属性？

Java BigDecimal 的对数

Java Rest Controller 无法识别 Spring Boot App 中的 GET 请求

如何将 Java 对象（bean）转换为键值对（反之亦然）？

相关推荐

最近更新

标签