将一个数据帧中的列添加到 Scala 中的另一个数据帧

Question

提问by Yousuf Zaman

I have two DataFrame with same number of row, but number of column is different and dynamic according to source.

我有两个行数相同的 DataFrame，但列数根据来源不同并且是动态的。

First DataFrame contains all columns, but the second DataFrame is filtered and processed which don't have all other.

第一个 DataFrame 包含所有列，但第二个 DataFrame 被过滤和处理，没有所有其他列。

Need to pick specific column from first DataFrame and add/merge with second DataFrame.

需要从第一个 DataFrame 中选择特定的列并与第二个 DataFrame 添加/合并。

val sourceDf = spark.read.load(parquetFilePath)
val resultDf = spark.read.load(resultFilePath)

val columnName :String="Col1"

I tried to add in several ways, here i am just giving few one....

我试图以多种方式添加，在这里我只给出几个......

val modifiedResult = resultDf.withColumn(columnName, sourceDf.col(columnName))

val modifiedResult = resultDf.withColumn(columnName, sourceDf(columnName))
val modifiedResult = resultDf.withColumn(columnName, labelColumnUdf(sourceDf.col(columnName)))

None of these are working.

这些都不起作用。

Can you please help me on this to merge/add column to the 2nd DataFrame from 1st DataFrame.

你能帮我从第一个数据帧合并/添加列到第二个数据帧吗？

Given example are not the exact data structure that i need, but it will fulfill my requirement to resolve this issue.

给定的示例不是我需要的确切数据结构，但它将满足我解决此问题的要求。

Sample Input Output:

样本输入输出：

Source DataFrame:
+---+------+---+
|InputGas|
+---+------+---+
|1000|
|2000|
|3000|
|4000|
+---+------+---+

Result DataFrame:
+---+------+---+
| Time|CalcGas|Speed|
+---+------+---+
|  0 | 111| 1111|
|  0 | 222| 2222|
|  1 | 333| 3333|
|  2 | 444| 4444|
+---+------+---+

Expected Output:
+---+------+---+
|Time|CalcGas|Speed|InputGas|
+---+------+---+---+
|  0|111 | 1111 |1000|
|  0|222 | 2222 |2000|
|  1|333 | 3333 |3000|
|  2|444 | 4444 |4000|
+---+------+---+---+

Answer 1

回答by Prasad Khode

one way to achieve this using join

使用的一种方法来实现这一点 join

In case if you have some common column in both the dataframes then you can perform join on that column and get your desire result.

如果您在两个数据框中都有一些公共列，那么您可以对该列执行连接并获得您想要的结果。

Example:

例子：

import sparkSession.sqlContext.implicits._

val df1 = Seq((1, "Anu"),(2, "Suresh"),(3, "Usha"), (4, "Nisha")).toDF("id","name")
val df2 = Seq((1, 23),(2, 24),(3, 24), (4, 25), (5, 30), (6, 32)).toDF("id","age")

val df = df1.as("df1").join(df2.as("df2"), df1("id") === df2("id")).select("df1.id", "df1.name", "df2.age")
df.show()

Output:

输出：

+---+------+---+
| id|  name|age|
+---+------+---+
|  1|   Anu| 23|
|  2|Suresh| 24|
|  3|  Usha| 24|
|  4| Nisha| 25|
+---+------+---+

Update:

更新：

In case if you don't have any unique id common in both dataframes, then create one and use it.

如果您在两个数据帧中没有任何共同的唯一 ID，则创建一个并使用它。

import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.functions._

var sourceDf = Seq(1000, 2000, 3000, 4000).toDF("InputGas")
var resultDf  = Seq((0, 111, 1111), (0, 222, 2222), (1, 333, 3333), (2, 444, 4444)).toDF("Time", "CalcGas", "Speed")

sourceDf = sourceDf.withColumn("rowId1", monotonically_increasing_id())
resultDf = resultDf.withColumn("rowId2", monotonically_increasing_id())

val df = sourceDf.as("df1").join(resultDf.as("df2"), sourceDf("rowId1") === resultDf("rowId2"), "inner").select("df1.InputGas", "df2.Time", "df2.CalcGas", "df2.Speed")
df.show()

Output:

输出：

+--------+----+-------+-----+
|InputGas|Time|CalcGas|Speed|
+--------+----+-------+-----+
|    1000|   0|    111| 1111|
|    2000|   0|    222| 2222|
|    3000|   1|    333| 3333|
|    4000|   2|    444| 4444|
+--------+----+-------+-----+

将一个数据帧中的列添加到 Scala 中的另一个数据帧

提问by Yousuf Zaman

回答by Prasad Khode

Update:

更新：

相关推荐

最近更新

标签

将一个数据帧中的列添加到 Scala 中的另一个数据帧

提问by Yousuf Zaman

回答by Prasad Khode

Update:

更新：

相关推荐

scala 将二进制文件读入 Spark

scala Spark 列字符串在其他列（行）中出现时替换

scala Akka Stream Kafka 与 Kafka Streams

scala org.apache.spark.SparkException：无法执行用户定义的函数

相关推荐

最近更新

标签