SQL 使用 group by 与 distinct 时的巨大性能差异

Question

提问by Martin Dimitrov

I am performing some tests on a HSQLDBserver with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCTquery:

我正在HSQLDB一个包含 500 000 个条目的表的服务器上执行一些测试。该表没有索引。有 5000 个不同的业务键。我需要他们的清单。自然地，我从一个DISTINCT查询开始：

SELECT DISTINCT business_key FROM memory WHERE
   concept <> 'case' or 
   attrib <> 'status' or 
   value <> 'closed'

It takes around 90 seconds!!!

大约需要 90 秒！！！

Then I tried using GROUP BY:

然后我尝试使用GROUP BY：

SELECT business_key FROM memory WHERE
       concept <> 'case' or 
       attrib <> 'status' or 
       value <> 'closed'
GROUP BY business_key

And it takes 1 second!!!

而且需要1秒！！！

Trying to figure out the difference I ran EXLAIN PLAN FORbut it seems to give the same information for both queries.

试图找出我运行的差异，EXLAIN PLAN FOR但它似乎为两个查询提供了相同的信息。

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EDIT: I did additional tests. With 500 000 records in HSQLDBwith all distinct business keys, the performance of DISTINCTis now better - 3 seconds, vs GROUP BYwhich took around 9 seconds.

编辑：我做了额外的测试。有 500 000 条记录HSQLDB和所有不同的业务键，DISTINCT现在的性能更好 - 3 秒，而GROUP BY后者需要大约 9 秒。

In MySQLboth queries preform the same:

在MySQL这两个查询中执行相同的：

MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ...- 11 seconds SELECT ... GROUP BY business_key- 13 seconds

MySQL：500 000 行 - 5 000 个不同的业务键：两个查询：0.5 秒 MySQL：500 000 行 - 所有不同的业务键： SELECT DISTINCT ...- 11 秒 SELECT ... GROUP BY business_key- 13 秒

So the problem is only related to HSQLDB.

所以问题只与HSQLDB.

I will be very grateful if someone can explain why there is such a drastic difference.

如果有人能解释为什么会有如此巨大的差异，我将不胜感激。

Answer 1

回答by Andomar

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinctapproach is executed like:

这两个查询表达了相同的问题。显然查询优化器选择了两个不同的执行计划。我的猜测是该distinct方法的执行方式如下：

Copy all business_keyvalues to a temporary table
Sort the temporary table
Scan the temporary table, returning each item that is different from the one before it

将所有business_key值复制到临时表
对临时表进行排序
扫描临时表，返回与前一项不同的每一项

The group bycould be executed like:

该group by会等执行：

Scan the full table, storing each value of business keyin a hashtable
Return the keys of the hashtable

扫描全表，将每个值存储business key在一个哈希表中
返回哈希表的键

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

第一种方法针对内存使用进行了优化：当必须换出部分临时表时，它仍然会表现得相当好。第二种方法针对速度进行了优化，但如果有很多不同的键，则可能需要大量内存。

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

由于您要么有足够的内存，要么有几个不同的键，所以第二种方法优于第一种方法。在两个执行计划之间看到 10 倍甚至 100 倍的性能差异并不罕见。

SQL 使用 group by 与 distinct 时的巨大性能差异

提问by Martin Dimitrov

回答by Andomar

相关推荐

最近更新

标签

SQL 使用 group by 与 distinct 时的巨大性能差异

提问by Martin Dimitrov

回答by Andomar

相关推荐

想从 sql server 显示 12 个月的名字

SQL Server 查询 - 如果不匹配则返回空值

SQL Server 2012 列标识增量在第 7 个条目上从 6 跳到 1000+

SQL 在 SSIS 包中将字符串转换为十进制值

相关推荐

最近更新

标签