scala 不同 SaveMode 下的 saveAsTable 和 insertInto 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47844808/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:31:15  来源:igfitidea点击:

What are the differences between saveAsTable and insertInto in different SaveMode(s)?

scalaapache-sparkapache-spark-sql

提问by y2k-shubham

I'm trying to write a DataFrameinto Hivetable (on S3) in Overwritemode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTablediffers from df.write.insertIntoin the following respects:

我试图写一个DataFrame进入Hive表(上S3中)Overwrite模式(需要我的应用程序),并需要DataFrameWriter(火花/斯卡拉)的两种方法之间做出选择。从我可以中读取文档df.write.saveAsTable不同于df.write.insertInto在以下几个方面:

  • saveAsTableuses column-name based resolutionwhile insertIntouses position-based resolution
  • In Append mode, saveAsTablepays more attention to underlying schema of the existing table to make certain resolutions
  • saveAsTable使用基于列名的分辨率,insertInto使用基于位置的分辨率
  • 在 Append 模式下,saveAsTable更关注现有表的底层架构,以做出一定的决议

Overall, it gives me the impression that saveAsTableis just a smarter versionof insertInto. Alternatively, depending on use-case, one might prefer insertInto

总体而言,它给我的印象是,saveAsTable仅仅是一个更聪明的版本insertInto。或者,根据用例,人们可能更喜欢insertInto

But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable(since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?

但是,这些方法中的每一种是否都带有一些自己的警告,比如在这种情况下的性能损失saveAsTable(因为它包含更多功能)?除了文档中提到的(不是很清楚)之外,他们的行为是否还有其他差异?



EDIT-1

编辑-1

Documentation says this regarding insertInto

文档说明了这一点 insertInto

Inserts the content of the DataFrame to the specified table

将 DataFrame 的内容插入到指定的表中

and this for saveAsTable

这对于 saveAsTable

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function

在表已经存在的情况下,此函数的行为取决于模式函数指定的保存模式

Now I can list my doubts

现在我可以列出我的疑惑

  • Does insertIntoalways expect the table to exist?
  • Do SaveModes have any impact on insertInto?
  • If above answer is yes, then
    • what's the differences between saveAsTablewith SaveMode.Appendand insertIntogiven that table already exists?
    • does insertIntowith SaveMode.Overwritemake any sense?
  • 是否insertInto总是期望表存在?
  • 不要SaveMode■找什么影响insertInto
  • 如果以上答案是肯定的,那么
    • saveAsTablewithSaveMode.AppendinsertIntogiven 该表已经存在有什么区别?
    • insertIntoSaveMode.Overwrite任何意义?

回答by Jacek Laskowski

DISCLAIMERI've been exploring insertIntofor some time and although I'm far from an expert in this area I'm sharing the findings for greater good.

免责声明我已经探索insertInto了一段时间,虽然我远不是这方面的专家,但我分享这些发现是为了更好。

Does insertIntoalways expect the table to exist?

是否insertInto总是期望表存在?

Yes(per the table name and the database).

(根据表名和数据库)。

Moreover not all tables can be inserted into, i.e. a (permanent) table, a temporary view or a temporary global view are fine, but not:

此外,并非所有表都可以插入,即(永久)表、临时视图或临时全局视图都可以,但不能:

  1. a bucketed table

  2. an RDD-based table

  1. 一个桶形表

  2. 基于 RDD 的表

Do SaveModes have any impact on insertInto?

SaveModes 对 insertInto 有影响吗?

(That's recently been my question, too!)

(这也是我最近的问题!)

Yes, but only SaveMode.Overwrite. After you think about insertIntothe other 3 save modes don't make much sense (as it simply inserts a dataset).

是的,但只有SaveMode.Overwrite。在您考虑insertInto其他 3 种保存模式后没有多大意义(因为它只是插入数据集)。

what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?

鉴于该表已经存在,saveAsTable 与 SaveMode.Append 和 insertInto 之间有什么区别?

That's a very good question! I'd say none, but let's see by just one example (hoping that proves something).

这是一个很好的问题!我会说没有,但让我们通过一个例子来看看(希望能证明一些事情)。

scala> spark.version
res13: String = 2.4.0-SNAPSHOT

sql("create table my_table (id long)")
scala> spark.range(3).write.mode("append").saveAsTable("my_table")
org.apache.spark.sql.AnalysisException: The format of the existing table default.my_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;
  at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply.applyOrElse(rules.scala:117)
  at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply.applyOrElse(rules.scala:76)
...
scala> spark.range(3).write.insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|  2|
|  0|
|  1|
+---+

does insertInto with SaveMode.Overwrite make any sense?

使用 SaveMode.Overwrite 插入插入是否有意义?

I think so given it pays so much attention to SaveMode.Overwrite. It simply re-creates the target table.

我认为如此给予它如此重视SaveMode.Overwrite。它只是重新创建目标表。

spark.range(3).write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|  1|
|  0|
|  2|
+---+

Seq(100, 200, 300).toDF.write.mode("overwrite").insertInto("my_table")
scala> spark.table("my_table").show
+---+
| id|
+---+
|200|
|100|
|300|
+---+

回答by Sandeep Khot

I want to point out a major difference between SaveAsTableand insertIntoin SPARK.

我想指出SPARKSaveAsTableinsertIntoSPARK之间的主要区别。

In partitioned table overwriteSaveMode work differently in case of SaveAsTableand insertInto.

在分区表中overwriteSaveMode 在SaveAsTable和 的情况下工作不同insertInto

Consider below example.Where I am creating partitioned table using SaveAsTablemethod.

考虑下面的例子。我正在使用SaveAsTable方法创建分区表。

hive> CREATE TABLE `db.companies_table`(`company` string) PARTITIONED BY ( `id` date);
OK
Time taken: 0.094 seconds
import org.apache.spark.sql._*
import spark.implicits._
import org.apache.spark.sql._

scala>val targetTable = "db.companies_table"

scala>val companiesDF = Seq(("2020-01-01", "Company1"), ("2020-01-02", "Company2")).toDF("id", "company")

scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)

scala> spark.sql("select * from db.companies_table").show()
+--------+----------+
| company|        id|
+--------+----------+
|Company1|2020-01-01|
|Company2|2020-01-02|
+--------+----------+

Now I am adding 2 new rows with 2 new partition values.

现在我要添加 2 个新行和 2 个新分区值。

scala> val companiesDF = Seq(("2020-01-03", "Company1"), ("2020-01-04", "Company2")).toDF("id", "company")

scala> companiesDF.write.mode(SaveMode.Append).partitionBy("id").saveAsTable(targetTable)

scala>spark.sql("select * from db.companies_table").show()

+--------+----------+                                                           
| company|        id|
+--------+----------+
|Company1|2020-01-01|
|Company2|2020-01-02|
|Company1|2020-01-03|
|Company2|2020-01-04|
+--------+----------+

As you can see 2 new rows are added to the table.

如您所见,表中添加了 2 个新行。

Now let`s say i want to Overwritepartition 2020-01-02 data.

现在假设我想对Overwrite2020-01-02 数据进行分区。

scala> val companiesDF = Seq(("2020-01-02", "Company5")).toDF("id", "company")

scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)

As per our logic only partitions 2020-01-02 should be overwritten but the case with SaveAsTableis different.It will overwrite the enter table as you can see below.

按照我们的逻辑,只有分区 2020-01-02 应该被覆盖,但情况SaveAsTable不同。它会覆盖输入表,如下所示。

scala> spark.sql("select * from db.companies_table").show()
+-------+----------+
|company|        id|
+-------+----------+
|Company5|2020-01-02|
+-------+----------+

So if we want to overwrite only certain partitions in the table using SaveAsTableits not possible.

因此,如果我们只想使用SaveAsTable它来覆盖表中的某些分区是不可能的。

Refer this Link for more details. https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

有关更多详细信息,请参阅此链接。 https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

回答by anshuman sharma

Another important point that I do consider while inserting data into an EXISTING Hive dynamic partitioned table from spark 2.xx :

在将数据插入来自 spark 2.xx 的 EXISTING Hive 动态分区表时,我考虑的另一个重要点:

df.write.mode("append").insertInto("dbName"."tableName")

Above command will intrinsically map the data in your "df" and append only new partitions to existing table.

上面的命令将本质上映射“df”中的数据,并仅将新分区附加到现有表中。

Hope, it adds another point in deciding when to use "insertInto".

希望,它增加了决定何时使用“insertInto”的另一点。

回答by ZeroDecibels

I recently started converting my Hive Scripts to Spark and I am still learning.

我最近开始将我的 Hive 脚本转换为 Spark,我仍在学习。

There is one important behavior I noticed with saveAsTable and insertInto which has not been discussed.

我注意到 saveAsTable 和 insertInto 的一个重要行为尚未讨论。

df.write.mode("overwrite").saveAsTable("schema.table") drops the existing table "schema.table" and recreates a new table based on the 'df' schema. The schema of the existing table becomes irrelevant and does not have to match with df. I got bitten by this behavior since my existing table was ORC and the new table created was parquet (Spark Default).

df.write.mode("overwrite").saveAsTable("schema.table") 删除现有表 "schema.table" 并基于 'df' 模式重新创建一个新表。现有表的模式变得无关紧要,不必与 df 匹配。我被这种行为咬了,因为我现有的表是 ORC,创建的新表是镶木地板(Spark Default)。

df.write.mode("overwrite").insertInto("schema.table") does not drop the existing table and expects the schema of the existing table to match with the schema of 'df'.

df.write.mode("overwrite").insertInto("schema.table") 不会删除现有表,并期望现有表的架构与 'df' 的架构匹配。

I checked the Create Time for the table using both options and reaffirmed the behavior.

我使用这两个选项检查了表的创建时间并重申了该行为。

Original Table stored as ORC - Wed Sep 04 21:27:33 GMT 2019

原始表存储为 ORC - 格林威治标准时间 2019 年 9 月 4 日星期三 21:27:33

After saveAsTable (storage changed to Parquet) - Wed Sep 04 21:56:23 GMT 2019(Create Time changed)

saveAsTable 之后(存储更改为 Parquet)- 2019 年 9 月 4 日星期三 21:56:23 GMT(创建时间更改)

Dropped and Recreated origina table (ORC) - Wed Sep 04 21:57:38 GMT 2019

删除和重新创建的原始表 (ORC) -格林威治标准时间 2019 年 9 月 4 日星期三 21:57:38

After insertInto (Still ORC) - Wed Sep 04 21:57:38 GMT 2019(Create Time Not changed)

insertInto 之后(仍然是 ORC)- 2019 年 9 月 4 日星期三 21:57:38 GMT(创建时间未更改)