SQL 使用 group by 与 distinct 时的巨大性能差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7943957/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Huge performance difference when using group by vs distinct
提问by Martin Dimitrov
I am performing some tests on a HSQLDB
server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT
query:
我正在HSQLDB
一个包含 500 000 个条目的表的服务器上执行一些测试。该表没有索引。有 5000 个不同的业务键。我需要他们的清单。自然地,我从一个DISTINCT
查询开始:
SELECT DISTINCT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
It takes around 90 seconds!!!
大约需要 90 秒!!!
Then I tried using GROUP BY
:
然后我尝试使用GROUP BY
:
SELECT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
GROUP BY business_key
And it takes 1 second!!!
而且需要1秒!!!
Trying to figure out the difference I ran EXLAIN PLAN FOR
but it seems to give the same information for both queries.
试图找出我运行的差异,EXLAIN PLAN FOR
但它似乎为两个查询提供了相同的信息。
EXLAIN PLAN FOR DISTINCT ...
EXLAIN PLAN FOR DISTINCT ...
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EDIT:
I did additional tests. With 500 000 records in HSQLDB
with all distinct business keys, the performance of DISTINCT
is now better - 3 seconds, vs GROUP BY
which took around 9 seconds.
编辑:我做了额外的测试。有 500 000 条记录HSQLDB
和所有不同的业务键,DISTINCT
现在的性能更好 - 3 秒,而GROUP BY
后者需要大约 9 秒。
In MySQL
both queries preform the same:
在MySQL
这两个查询中执行相同的:
MySQL: 500 000 rows - 5 000 distinct business keys:
Both queries: 0.5 second
MySQL: 500 000 rows - all distinct business keys:
SELECT DISTINCT ...
- 11 seconds
SELECT ... GROUP BY business_key
- 13 seconds
MySQL:500 000 行 - 5 000 个不同的业务键:两个查询:0.5 秒 MySQL:500 000 行 - 所有不同的业务键:
SELECT DISTINCT ...
- 11 秒
SELECT ... GROUP BY business_key
- 13 秒
So the problem is only related to HSQLDB
.
所以问题只与HSQLDB
.
I will be very grateful if someone can explain why there is such a drastic difference.
如果有人能解释为什么会有如此巨大的差异,我将不胜感激。
回答by Andomar
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
这两个查询表达了相同的问题。显然查询优化器选择了两个不同的执行计划。我的猜测是该distinct
方法的执行方式如下:
- Copy all
business_key
values to a temporary table - Sort the temporary table
- Scan the temporary table, returning each item that is different from the one before it
- 将所有
business_key
值复制到临时表 - 对临时表进行排序
- 扫描临时表,返回与前一项不同的每一项
The group by
could be executed like:
该group by
会等执行:
- Scan the full table, storing each value of
business key
in a hashtable - Return the keys of the hashtable
- 扫描全表,将每个值存储
business key
在一个哈希表中 - 返回哈希表的键
The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
第一种方法针对内存使用进行了优化:当必须换出部分临时表时,它仍然会表现得相当好。第二种方法针对速度进行了优化,但如果有很多不同的键,则可能需要大量内存。
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.
由于您要么有足够的内存,要么有几个不同的键,所以第二种方法优于第一种方法。在两个执行计划之间看到 10 倍甚至 100 倍的性能差异并不罕见。