SQL 散列连接和合并连接 (Oracle RDBMS) 之间有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1111707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 02:45:23  来源:igfitidea点击:

What is the difference between a hash join and a merge join (Oracle RDBMS )?

sqlperformanceoraclejoin

提问by Andrew Martinez

What are the performance gains/losses between hash joins and merge joins, specifically in Oracle RDBMS?

散列连接和合并连接之间的性能增益/损失是什么,特别是在 Oracle RDBMS 中?

回答by David Aldridge

A "sort merge" join is performed by sorting the two data sets to be joined according to the join keys and then merging them together. The merge is very cheap, but the sort can be prohibitively expensive especially if the sort spills to disk. The cost of the sort can be lowered if one of the data sets can be accessed in sorted order via an index, although accessing a high proportion of blocks of a table via an index scan can also be very expensive in comparison to a full table scan.

“排序合并”连接是通过根据连接键对要连接的两个数据集进行排序,然后将它们合并在一起来执行的。合并非常便宜,但排序可能会非常昂贵,特别是如果排序溢出到磁盘。如果可以通过索引按排序顺序访问其中一个数据集,则可以降低排序的成本,尽管与全表扫描相比,通过索引扫描访问表的大部分块也可能非常昂贵.

A hash join is performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches. The hash join is very low cost when the hash table can be held entirely in memory, with the total cost amounting to very little more than the cost of reading the data sets. The cost rises if the hash table has to be spilled to disk in a one-pass sort, and rises considerably for a multipass sort.

散列连接是通过基于连接列将一个数据集散列到内存中并读取另一个数据集并探测散列表以进行匹配来执行的。当哈希表可以完全保存在内存中时,哈希连接的成本非常低,总成本仅比读取数据集的成本多一点。如果哈希表必须在单遍排序中溢出到磁盘,成本会增加,而对于多遍排序,成本会显着增加。

(In pre-10g, outer joins from a large to a small table were problematic performance-wise, as the optimiser could not resolve the need to access the smaller table first for a hash join, but the larger table first for an outer join. Consequently hash joins were not available in this situation).

(在 10g 之前,从大表到小表的外连接在性能方面存在问题,因为优化器无法解决首先访问较小表进行散列连接的需求,但首先访问较大表进行外连接。因此,在这种情况下无法使用散列连接)。

The cost of a hash join can be reduced by partitioning both tables on the join key(s). This allows the optimiser to infer that rows from a partition in one table will only find a match in a particular partition of the other table, and for tables having n partitions the hash join is executed as n independent hash joins. This has the following effects:

通过在连接键上对两个表进行分区,可以降低散列连接的成本。这允许优化器推断来自一个表中某个分区的行只会在另一个表的特定分区中找到匹配项,并且对于具有 n 个分区的表,散列连接作为 n 个独立散列连接执行。这有以下影响:

  1. The size of each hash table is reduced, hence reducing the maximum amount of memory required and potentially removing the need for the operation to require temporary disk space.
  2. For parallel query operations the amount of inter-process messaging is vastly reduced, reducing CPU usage and improving performance, as each hash join can be performed by one pair of PQ processes.
  3. For non-parallel query operations the memory requirement is reduced by a factor of n, and the first rows are projected from the query earlier.
  1. 每个哈希表的大小都减少了,从而减少了所需的最大内存量,并可能消除操作需要临时磁盘空间的需要。
  2. 对于并行查询操作,进程间消息传递的数量大大减少,从而降低了 CPU 使用率并提高了性能,因为每个散列连接都可以由一对 PQ 进程执行。
  3. 对于非并行查询操作,内存需求减少了 n 倍,并且第一行是从更早的查询中投影出来的。

You should note that hash joins can only be used for equi-joins, but merge joins are more flexible.

您应该注意散列连接只能用于等连接,但合并连接更灵活。

In general, if you are joining large amounts of data in an equi-join then a hash join is going to be a better bet.

一般来说,如果您在等连接中连接大量数据,那么散列连接将是更好的选择。

This topic is very well covered in the documentation.

文档中很好地涵盖了这个主题。

http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/optimops.htm#i51523

http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/optimops.htm#i51523

12.1 docs: https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm

12.1 文档:https: //docs.oracle.com/database/121/TGSQL/tgsql_join.htm

回答by Spence

I just want to edit this for posterity that the tags for oracle weren't added when I answered this question. My response was more applicable to MS SQL.

我只是想为后代编辑这个,因为在我回答这个问题时没有添加 oracle 的标签。我的回答更适用于 MS SQL。

Merge join is the best possible as it exploits the ordering, resulting in a single pass down the tables to do the join. IF you have two tables (or covering indexes) that have their ordering the same such as a primary key and an index of a table on that key then a merge join would result if you performed that action.

合并连接是最好的方法,因为它利用了排序,从而导致对表进行单次传递以进行连接。如果您有两个表(或覆盖索引)的顺序相同,例如主键和该键上的表的索引,那么如果您执行该操作,则会产生合并连接。

Hash join is the next best, as it's usually done when one table has a small number (relatively) of items, its effectively creating a temp table with hashes for each row which is then searched continuously to create the join.

散列连接是次佳的,因为它通常在一个表具有少量(相对)项目时完成,它有效地创建了一个临时表,其中每一行都有散列,然后连续搜索以创建连接。

Worst case is nested loop which is order (n * m) which means there is no ordering or size to exploit and the join is simply, for each row in table x, search table y for joins to do.

最坏的情况是嵌套循环,它的顺序为 (n * m),这意味着没有可利用的排序或大小,连接很简单,对于表 x 中的每一行,搜索表 y 以进行连接。