SQL 性能:SELECT DISTINCT 与 GROUP BY

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13956768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 12:42:59  来源:igfitidea点击:

SQL Performance: SELECT DISTINCT versus GROUP BY

sqlperformanceoraclegroup-bydistinct

提问by woemler

I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the DISTINCTwith a GROUP BYclause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was that SELECT DISTINCTand GROUP BYoperated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation where SELECT DISTINCTruns faster?

我一直在尝试改进运行缓慢的现有 Oracle 数据库驱动应用程序的查询时间。该应用程序执行多个大型查询,例如下面的查询,运行时间可能需要一个多小时。替换下面查询中的DISTINCTwithGROUP BY子句将执行时间从 100 分钟缩短到 10 秒。我的理解是,SELECT DISTINCT并且GROUP BY以几乎相同的方式运作。为什么执行时间之间存在如此巨大的差异?在后端执行查询的方式有什么不同?有没有SELECT DISTINCT跑得更快的情况?

Note: In the following query, WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in the SELECTand would result in about a tenth of all available data

注意:在以下查询中,WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'仅表示可以过滤结果的多种方式之一。提供此示例是为了说明连接所有没有包含在 中的列的表的推理,SELECT并且将导致所有可用数据的大约十分之一

SQL using DISTINCT:

SQL 使用DISTINCT

SELECT DISTINCT 
    ITEMS.ITEM_ID,
    ITEMS.ITEM_CODE,
    ITEMS.ITEMTYPE,
    ITEM_TRANSACTIONS.STATUS,
    (SELECT COUNT(PKID) 
        FROM ITEM_PARENTS 
        WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
        ) AS CHILD_COUNT
FROM
    ITEMS
    INNER JOIN ITEM_TRANSACTIONS 
        ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
        AND ITEM_TRANSACTIONS.FLAG = 1
    LEFT OUTER JOIN ITEM_METADATA 
        ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
    LEFT OUTER JOIN JOB_INVENTORY 
        ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID     
    LEFT OUTER JOIN JOB_TASK_INVENTORY 
        ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
    LEFT OUTER JOIN JOB_TASKS 
        ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID                              
    LEFT OUTER JOIN JOBS 
        ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
    LEFT OUTER JOIN TASK_INVENTORY_STEP 
        ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID 
    LEFT OUTER JOIN TASK_STEP_INFORMATION 
        ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE 
    TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
ORDER BY 
    ITEMS.ITEM_CODE

SQL using GROUP BY:

SQL 使用GROUP BY

SELECT
    ITEMS.ITEM_ID,
    ITEMS.ITEM_CODE,
    ITEMS.ITEMTYPE,
    ITEM_TRANSACTIONS.STATUS,
    (SELECT COUNT(PKID) 
        FROM ITEM_PARENTS 
        WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
        ) AS CHILD_COUNT
FROM
    ITEMS
    INNER JOIN ITEM_TRANSACTIONS 
        ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
        AND ITEM_TRANSACTIONS.FLAG = 1
    LEFT OUTER JOIN ITEM_METADATA 
        ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
    LEFT OUTER JOIN JOB_INVENTORY 
        ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID     
    LEFT OUTER JOIN JOB_TASK_INVENTORY 
        ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
    LEFT OUTER JOIN JOB_TASKS 
        ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID                              
    LEFT OUTER JOIN JOBS 
        ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
    LEFT OUTER JOIN TASK_INVENTORY_STEP 
        ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID 
    LEFT OUTER JOIN TASK_STEP_INFORMATION 
        ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE 
    TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
GROUP BY
    ITEMS.ITEM_ID,
    ITEMS.ITEM_CODE,
    ITEMS.ITEMTYPE,
    ITEM_TRANSACTIONS.STATUS
ORDER BY 
    ITEMS.ITEM_CODE

Here is the Oracle query plan for the query using DISTINCT:

这是使用以下查询的 Oracle 查询计划DISTINCT

Oracle query plan for query using DISTINCT

使用 DISTINCT 进行查询的 Oracle 查询计划

Here is the Oracle query plan for the query using GROUP BY:

这是使用以下查询的 Oracle 查询计划GROUP BY

Oracle query plan for query using GROUP BY

使用 GROUP BY 进行查询的 Oracle 查询计划

采纳答案by Gordon Linoff

The performance difference is probably due to the execution of the subquery in the SELECTclause. I am guessing that it is re-executing this query for every row beforethe distinct. For the group by, it would execute once afterthe group by.

性能差异可能是由于SELECT子句中子查询的执行。我猜它正在为distinct之前的每一行重新执行这个查询。对于group by,它将在 group by之后执行一次。

Try replacing it with a join, instead:

尝试用连接替换它,而不是:

select . . .,
       parentcnt
from . . . left outer join
      (SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
       FROM ITEM_PARENTS 
      ) p
      on items.item_id = p.parent_item_id

回答by Vincent Malgrat

I'm fairly sure that GROUP BYand DISTINCThave roughly the same execution plan.

我相当确定,GROUP BY并且DISTINCT具有大致相同的执行计划。

The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTERthe GROUP BYbut BEFOREthe DISTINCT.

因为我们猜测(因为我们没有解释计划)这里的区别是IMO的内嵌子查询被执行GROUP BY,但之前DISTINCT

So if your query returns 1M rows and gets aggregated to 1k rows:

因此,如果您的查询返回 1M 行并聚合为 1k 行:

  • The GROUP BYquery would have run the subquery 1000 times,
  • Whereas the DISTINCTquery would have run the subquery 1000000 times.
  • GROUP BY查询会运行子查询1000次,
  • DISTINCT查询将运行子查询 1000000 次。

The tkprof explain plan would help demonstrate this hypothesis.

tkprof 解释计划将有助于证明这一假设。



While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPEwith a value of "TYPE A".

在我们讨论这个问题时,我认为重要的是要注意查询的编写方式会误导读者和优化器:您显然希望从 item/item_transactions 中找到TASK_INVENTORY_STEP.STEP_TYPE具有值“的所有行”类型 A”。

IMO your query would have a better plan and would be more easily readable if written like this:

IMO 您的查询将有一个更好的计划,如果这样写,将更容易阅读:

SELECT ITEMS.ITEM_ID,
       ITEMS.ITEM_CODE,
       ITEMS.ITEMTYPE,
       ITEM_TRANSACTIONS.STATUS,
       (SELECT COUNT(PKID) 
          FROM ITEM_PARENTS 
         WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT
  FROM ITEMS
  JOIN ITEM_TRANSACTIONS 
    ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
   AND ITEM_TRANSACTIONS.FLAG = 1
 WHERE EXISTS (SELECT NULL
                 FROM JOB_INVENTORY   
                 JOIN TASK_INVENTORY_STEP 
                   ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID
                WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
                  AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)

In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).

在许多情况下,DISTINCT 可能表示查询未正确编写(因为好的查询不应返回重复项)。

Note also that 4 tables are not used in your original select.

另请注意,原始选择中未使用 4 个表。

回答by Woot4Moo

The first thing that should be noted is the use of Distinctindicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group byis faster (without seeing the query), is that the location of the group byreduces the number of records that end up being returned. Whereas distinctis blowing out the result set and doing row by row comparisons.

应该注意的第一件事是使用Distinct指示代码异味,也就是反模式。这通常意味着缺少连接或生成重复数据的额外连接。查看上面的查询,我猜想group by更快的原因(没有看到查询)是因为位置group by减少了最终返回的记录数。而distinct正在炸毁结果集并进行逐行比较。

Update to approach

更新方法

Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.

对不起,我应该更清楚。用户在系统中执行某些任务时会生成记录,因此没有时间表。用户可以在一天内或每小时生成数百条记录。重要的是,每次用户运行搜索时,都必须返回最新的记录,这让我怀疑物化视图是否可以在这里工作,尤其是如果填充它的查询需要很长时间才能运行。

I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.

我相信这是使用物化视图的确切原因。所以这个过程会以这种方式工作。您将长时间运行的查询视为构建物化视图的部分,因为我们知道用户在系统中执行某些任意任务后只关心“新”数据。所以你要做的是查询这个基础物化视图,它可以在后端不断刷新,所涉及的持久化策略不应该扼杀物化视图(一次持久化几百条记录不会粉碎任何东西)。这将允许 Oracle 获取读锁(请注意,我们不关心有多少源读取我们的数据,我们只关心写入者)。在最坏的情况下,用户将拥有几微秒的“陈旧”数据,

Code example of how to do this:

如何执行此操作的代码示例:

create materialized view dept_mv FOR UPDATE as select * from dept; 

Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)

现在的关键是只要您不调用刷新,您就不会丢失任何持久化数据。由您决定何时再次“基线”您的物化视图(也许是午夜?)

回答by cesarfaria

You should use GROUP BY to apply aggregate operators to each group and DISTINCT if you only need to remove duplicates.

如果您只需要删除重复项,您应该使用 GROUP BY 将聚合运算符应用于每个组和 DISTINCT。

I think the performance is the same.

我认为性能是一样的。

In your case i think you should use GROUP BY.

在您的情况下,我认为您应该使用 GROUP BY。