将一个数据帧中的列添加到 Scala 中的另一个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47028442/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
add column from one dataframe to another dataframe in scala
提问by Yousuf Zaman
I have two DataFrame with same number of row, but number of column is different and dynamic according to source.
我有两个行数相同的 DataFrame,但列数根据来源不同并且是动态的。
First DataFrame contains all columns, but the second DataFrame is filtered and processed which don't have all other.
第一个 DataFrame 包含所有列,但第二个 DataFrame 被过滤和处理,没有所有其他列。
Need to pick specific column from first DataFrame and add/merge with second DataFrame.
需要从第一个 DataFrame 中选择特定的列并与第二个 DataFrame 添加/合并。
val sourceDf = spark.read.load(parquetFilePath)
val resultDf = spark.read.load(resultFilePath)
val columnName :String="Col1"
I tried to add in several ways, here i am just giving few one....
我试图以多种方式添加,在这里我只给出几个......
val modifiedResult = resultDf.withColumn(columnName, sourceDf.col(columnName))
val modifiedResult = resultDf.withColumn(columnName, sourceDf(columnName))
val modifiedResult = resultDf.withColumn(columnName, labelColumnUdf(sourceDf.col(columnName)))
None of these are working.
这些都不起作用。
Can you please help me on this to merge/add column to the 2nd DataFrame from 1st DataFrame.
你能帮我从第一个数据帧合并/添加列到第二个数据帧吗?
Given example are not the exact data structure that i need, but it will fulfill my requirement to resolve this issue.
给定的示例不是我需要的确切数据结构,但它将满足我解决此问题的要求。
Sample Input Output:
样本输入输出:
Source DataFrame:
+---+------+---+
|InputGas|
+---+------+---+
|1000|
|2000|
|3000|
|4000|
+---+------+---+
Result DataFrame:
+---+------+---+
| Time|CalcGas|Speed|
+---+------+---+
| 0 | 111| 1111|
| 0 | 222| 2222|
| 1 | 333| 3333|
| 2 | 444| 4444|
+---+------+---+
Expected Output:
+---+------+---+
|Time|CalcGas|Speed|InputGas|
+---+------+---+---+
| 0|111 | 1111 |1000|
| 0|222 | 2222 |2000|
| 1|333 | 3333 |3000|
| 2|444 | 4444 |4000|
+---+------+---+---+
回答by Prasad Khode
one way to achieve this using join
使用的一种方法来实现这一点 join
In case if you have some common column in both the dataframes then you can perform join on that column and get your desire result.
如果您在两个数据框中都有一些公共列,那么您可以对该列执行连接并获得您想要的结果。
Example:
例子:
import sparkSession.sqlContext.implicits._
val df1 = Seq((1, "Anu"),(2, "Suresh"),(3, "Usha"), (4, "Nisha")).toDF("id","name")
val df2 = Seq((1, 23),(2, 24),(3, 24), (4, 25), (5, 30), (6, 32)).toDF("id","age")
val df = df1.as("df1").join(df2.as("df2"), df1("id") === df2("id")).select("df1.id", "df1.name", "df2.age")
df.show()
Output:
输出:
+---+------+---+
| id| name|age|
+---+------+---+
| 1| Anu| 23|
| 2|Suresh| 24|
| 3| Usha| 24|
| 4| Nisha| 25|
+---+------+---+
Update:
更新:
In case if you don't have any unique id common in both dataframes, then create one and use it.
如果您在两个数据帧中没有任何共同的唯一 ID,则创建一个并使用它。
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.functions._
var sourceDf = Seq(1000, 2000, 3000, 4000).toDF("InputGas")
var resultDf = Seq((0, 111, 1111), (0, 222, 2222), (1, 333, 3333), (2, 444, 4444)).toDF("Time", "CalcGas", "Speed")
sourceDf = sourceDf.withColumn("rowId1", monotonically_increasing_id())
resultDf = resultDf.withColumn("rowId2", monotonically_increasing_id())
val df = sourceDf.as("df1").join(resultDf.as("df2"), sourceDf("rowId1") === resultDf("rowId2"), "inner").select("df1.InputGas", "df2.Time", "df2.CalcGas", "df2.Speed")
df.show()
Output:
输出:
+--------+----+-------+-----+
|InputGas|Time|CalcGas|Speed|
+--------+----+-------+-----+
| 1000| 0| 111| 1111|
| 2000| 0| 222| 2222|
| 3000| 1| 333| 3333|
| 4000| 2| 444| 4444|
+--------+----+-------+-----+

