spark - scala - 使用覆盖模式将数据帧保存到表中
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46474476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark - scala - save dataframe to a table with overwrite mode
提问by Uday Sagar
I would like to know what exactly "overwrite" does here. Let's say I have a table having the following records in table "tb1"(sorry for bad representation of tables)
我想知道这里的“覆盖”究竟是做什么的。假设我有一个表,在表“tb1”中有以下记录(抱歉表的不好表示)
driver vin make model
司机 vin 制造模型
martin abc ford escape
john abd toyota camry
amy abe chevrolet malibu
carlos abf honda civic
Now I have the following dataframe(mydf) with the same columns but with the follwing rows/data
现在我有以下具有相同列但具有以下行/数据的数据框(mydf)
martin abf toyota corolla
carlos abg nissan versa
After saving the above dataframe to the "tb1" with overwrite mode, will the dataframe entirely delete the contents of "tb1" and write the data of mydf(above two records)?
将上述dataframe以覆盖模式保存到“tb1”后,dataframe会不会完全删除“tb1”的内容,写入mydf的数据(以上两条记录)?
However, I would like the overwrite mode to overwrite only those rows that have same values for column "driver". In this case, of 4 records in "tb1", mydf would overwrite only above 2 records and the resultant table would be as follows-
但是,我希望覆盖模式仅覆盖那些“驱动程序”列具有相同值的行。在这种情况下,“tb1”中的 4 条记录中,mydf 将仅覆盖 2 条以上的记录,结果表如下-
driver vin make model
司机 vin 制造模型
martin abf toyota corolla
john abd toyota camry
amy abe chevrolet malibu
carlos abg nissan versa
Can I achieve this functionality using overwrite mode?
我可以使用覆盖模式实现此功能吗?
mydf.write.mode(SaveMode.Overwrite).saveAsTable("tb1")
回答by Avishek Bhattacharya
What you meant is merge 2 dataframes on the primary key. You want to merge two dataframe and replace the old rows with the new rows and append the extra rows if any present.
您的意思是在主键上合并 2 个数据帧。您想合并两个数据框并用新行替换旧行并附加额外的行(如果有)。
This can't be achieved by SaveMode.Overwrite or SaveMode.append.
这不能通过 SaveMode.Overwrite 或 SaveMode.append 实现。
To do this you need to implement merge functionality of 2 dataframe on the primary key.
为此,您需要在主键上实现 2 个数据框的合并功能。
Something like this
像这样的东西
parentDF = // actual dataframe
deltaDF = // new delta to be merged
val updateDF = spark.sql("select parentDF.* from parentDF join deltaDF on parentDF.id = deltaDF.id")
val totalDF = parentDF.except(updateDF).union(deltaDF)
totalDF.write.mode(SaveMode.Overwrite).saveAsTable("tb1")
回答by Erik Barajas
Answering your question:
回答你的问题:
Can I achieve this functionality using overwrite mode?
我可以使用覆盖模式实现此功能吗?
No, you can't.
不,你不能。
What function Overwrite does is practically, delete all the table that you want to populate and create it again but now with the new DataFrame that you are telling it.
Overwrite 的作用实际上是删除所有要填充的表并再次创建它,但现在使用您告诉它的新 DataFrame。
To get the result you want, you would do the following:
要获得您想要的结果,您将执行以下操作:
Save the information of your table to "update" into a new DataFrame:
val dfTable = hiveContext.read.table("table_tb1")
Do a Left Joinbetween your DF of the table to update (dfTable), and the DF (mydf) with your new information, crossing by your "PK", that in your case, will be the drivercolumn.
将表的信息保存到“更新”到新的 DataFrame 中:
val dfTable = hiveContext.read.table("table_tb1")
在要更新的表的 DF (dfTable) 和带有新信息的 DF (mydf) 之间进行左连接,通过您的“PK”交叉,在您的情况下,这将是驱动程序列。
In the same sentence, you filter the records where mydf("driver")column is null, that are the ones that are not matching and there is no update for these ones.
在同一个句子,你筛选记录那里是myDF(“驱动程序”)栏为空,这是不匹配的人并没有对这些部分上没有更新。
val newDf = dfTable.join(mydf, dfTable("driver") === mydf("driver"), "leftouter" ).filter(mydf("driver").isNull)
- After that, Truncate your table tb1and insert both DataFrames: the newDFand mydfDataFrames:
- 之后,截断表tb1并插入两个数据帧: newDF和mydf 数据帧:
|
|
dfArchivo.write.mode(SaveMode.Append).insertInto("table_tb1") /** Info with no changes */
mydf.write.mode(SaveMode.Append).insertInto("table_tb1") /** Info updated */
In that way, you can get the result you are looking for.
这样,你就可以得到你想要的结果。
Regards.
问候。

