SQL GROUP BY 和 DISTINCT 之间有什么区别吗

Question

提问by Brettski

I learned something simple about SQL the other day:

前几天我学到了一些关于 SQL 的简单知识：

SELECT c FROM myTbl GROUP BY C

Has the same result as:

具有相同的结果：

SELECT DISTINCT C FROM myTbl

What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?

我很好奇，SQL 引擎处理命令的方式有什么不同，或者它们真的是一样的吗？

I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.

我个人更喜欢独特的语法，但我相信它比其他任何东西都更出于习惯。

EDIT: This is not a question about aggregates. The use of GROUP BYwith aggregate functions is understood.

编辑：这不是关于聚合的问题。GROUP BY了解 with 聚合函数的使用。

Answer 1

采纳答案by Skeolan

MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."

对于您所陈述的问题，MusiGenesis的回答在功能上是正确的；SQL Server 足够聪明，可以意识到如果您使用“Group By”而不使用任何聚合函数，那么您的实际意思是“Distinct” - 因此它会生成一个执行计划，就像您只是使用“Distinct”一样.”

However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregatesand one of which is not.

然而，我认为重要的是要注意Hank的回应——如果你不小心，对“Group By”和“Distinct”的漫不经心的处理可能会导致一些有害的陷阱。说这“不是关于聚合的问题”并不完全正确，因为您是在询问两个 SQL 查询关键字之间的功能差异，其中一个用于聚合，另一个不是。

A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?

锤子有时可以用来拧螺丝，但如果您手边有螺丝刀，何必费心呢？

(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinctand screw => get list of unique values in a table column)

（为了这个类比的目的，Hammer : Screwdriver :: GroupBy : Distinct和screw => get list of unique values in a table column）

Answer 2

回答by Andru Luvisi

GROUP BYlets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT. On the other hand DISTINCTjust removes duplicates.

GROUP BY让您使用聚合函数，如AVG，MAX，MIN，SUM，和COUNT。另一方面DISTINCT只是删除重复项。

For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:

例如，如果您有一堆购买记录，并且您想知道每个部门花了多少钱，您可能会执行以下操作：

SELECT department, SUM(amount) FROM purchases GROUP BY department

This will give you one row per department, containing the department name and the sum of all of the amountvalues in all rows for that department.

这将为每个部门提供一行，其中包含部门名称以及该部门amount所有行中所有值的总和。

Answer 3

回答by MusiGenesis

There is no difference(in SQL Server, at least). Both queries use the same execution plan.

没有区别（至少在 SQL Server 中）。两个查询使用相同的执行计划。

http://sqlmag.com/database-performance-tuning/distinct-vs-group

Maybe there isa difference, if there are sub-queries involved:

也许有是有差别，如果有子查询涉及：

http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/

There is no difference(Oracle-style):

没有区别（Oracle 风格）：

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212

Answer 4

回答by Lukas Eder

What's the difference from a mere duplicate removal functionality point of view

从单纯的重复删除功能的角度来看有什么区别

Apart from the fact that unlike DISTINCT, GROUP BYallows for aggregating data per group(which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECTstatement.

除了一个事实，即不像DISTINCT，GROUP BY允许汇总数据每组（这已被许多其他的答案中提到），在我看来，最重要的区别是，两个操作中的“发生”在两个非常不同的步骤的逻辑顺序在SELECT语句中执行的操作。

Here are the most important operations:

以下是最重要的操作：

FROM(including JOIN, APPLY, etc.)
WHERE
GROUP BY(can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT(can remove duplicates)
UNION, INTERSECT, EXCEPT(can remove duplicates)
ORDER BY
OFFSET
LIMIT

FROM（包括JOIN，APPLY等）
WHERE
GROUP BY（可以删除重复项）
聚合
HAVING
窗口函数
SELECT
DISTINCT（可以删除重复项）
UNION, INTERSECT, EXCEPT（可以去除重复项）
ORDER BY
OFFSET
LIMIT

As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BYoperation "happens before"the SELECToperation (the projection) means that:

如您所见，每个操作的逻辑顺序会影响可以用它做什么以及它如何影响后续操作。特别地，事实GROUP BY操作“之前发生”的SELECT操作（投影）表示：

It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)

它不依赖于投影（这可能是一个优势）
它不能使用投影中的任何值（这可能是一个缺点）

1. It doesn't depend on the projection

1. 不依赖于投影

An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:

一个不依赖于投影的例子是有用的，如果你想计算不同值的窗函数：

SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating

When run against the Sakila database, this yields:

当针对Sakila 数据库运行时，这会产生：

rating   rn
-----------
G        1
NC-17    2
PG       3
PG-13    4
R        5

The same couldn't be achieved with DISTINCTeasily:

使用以下方法无法实现相同的目标DISTINCT：

SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film

That query is "wrong" and yields something like:

该查询是“错误的”并产生如下结果：

rating   rn
------------
G        1
G        2
G        3
...
G        178
NC-17    179
NC-17    180
...

This is not what we wanted. The DISTINCToperation "happens after"the projection, so we can no longer remove DISTINCTratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:

这不是我们想要的。该DISTINCT操作“发生在”投影之后，因此我们不能再删除DISTINCT评级，因为已经计算和投影了窗口函数。为了使用DISTINCT，我们必须嵌套查询的那部分：

SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
  SELECT DISTINCT rating FROM film
) f

Side-note: In this particular case, we could also use DENSE_RANK()

旁注：在这种特殊情况下，我们也可以使用DENSE_RANK()

SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film

2. It cannot use any values from the projection

2. 它不能使用投影中的任何值

One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.

SQL 的缺点之一是它有时很冗长。出于与我们之前看到的相同的原因（即操作的逻辑顺序），我们不能“轻松”地根据我们正在投影的内容进行分组。

This is invalid SQL:

这是无效的 SQL：

SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name

This is valid (repeating the expression)

这是有效的（重复表达式）

SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name

This is valid, too (nesting the expression)

这也是有效的（嵌套表达式）

SELECT name
FROM (
  SELECT first_name || ' ' || last_name AS name
  FROM customer
) c
GROUP BY name

I've written about this topic more in depth in a blog post

我在博客文章中更深入地介绍了这个主题

Answer 5

回答by jkramer

Use DISTINCTif you just want to remove duplicates. Use GROUPY BYif you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVINGclause).

使用DISTINCT，如果你只是想删除重复。使用GROUPY BY，如果你想将集合运算符（MAX，SUM，GROUP_CONCAT，...，或HAVING条款）。

Answer 6

回答by Dave Costa

I expect there is the possibility for subtle differences in their execution. I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:

我预计它们的执行可能存在细微差别。我在 Oracle 10g 中检查了两个功能等效的查询的执行计划：

core> select sta from zip group by sta;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    58 |   174 |    44  (19)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |    58 |   174 |    44  (19)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| ZIP  | 42303 |   123K|    38   (6)| 00:00:01 |
---------------------------------------------------------------------------

core> select distinct sta from zip;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    58 |   174 |    44  (19)| 00:00:01 |
|   1 |  HASH UNIQUE       |      |    58 |   174 |    44  (19)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| ZIP  | 42303 |   123K|    38   (6)| 00:00:01 |
---------------------------------------------------------------------------

The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).

中间操作略有不同：“HASH GROUP BY”与“HASH UNIQUE”，但估计成本等是相同的。然后我在跟踪的情况下执行这些，并且两者的实际操作计数相同（除了第二个由于缓存而不必进行任何物理读取）。

But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.

但我认为，由于操作名称不同，执行将遵循一些不同的代码路径，这开启了更显着差异的可能性。

I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.

为此，我认为您应该更喜欢 DISTINCT 语法。这不仅仅是习惯，它更清楚地表明了查询的目的。

Answer 7

回答by The Light

I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.

我阅读了上述所有评论，但除了聚合位之外，没有人指出 Group By 和 Distinct 之间的主要区别。

Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.

Distinct 返回所有行，然后对它们进行去重，而 Group By 会在算法一一读取行时对行进行去重。

This means they can produce different results!

这意味着它们可以产生不同的结果！

For example, the below codes generate different results:

例如，以下代码生成不同的结果：

SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable

 SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name

If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.

如果表中有 10 个名称，其中 1 个与另一个名称重复，则第一个查询返回 10 行，而第二个查询返回 9 行。

The reason is what I said above so they can behave differently!

原因就是我上面所说的，所以他们可以表现得不同！

Answer 8

回答by Joel Coehoorn

For the query you posted, they are identical. But for other queries that may not be true.

对于您发布的查询，它们是相同的。但对于其他可能不正确的查询。

For example, it's not the same as:

例如，它与以下不同：

SELECT C FROM myTbl GROUP BY C, D

Answer 9

回答by Bill the Lizard

If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.

如果您对多列使用 DISTINCT，结果集将不会像使用 GROUP BY 那样分组，并且您不能将聚合函数与 DISTINCT 一起使用。

Answer 10

回答by Hank Gay

They have different semantics, even if they happen to have equivalent results on your particular data.

它们具有不同的语义，即使它们碰巧对您的特定数据具有相同的结果。

SQL GROUP BY 和 DISTINCT 之间有什么区别吗

提问by Brettski

采纳答案by Skeolan

回答by Andru Luvisi

回答by MusiGenesis

回答by Lukas Eder

What's the difference from a mere duplicate removal functionality point of view

从单纯的重复删除功能的角度来看有什么区别

1. It doesn't depend on the projection

1. 不依赖于投影

2. It cannot use any values from the projection

2. 它不能使用投影中的任何值

回答by jkramer

回答by Dave Costa

回答by The Light

回答by Joel Coehoorn

回答by Bill the Lizard

回答by Hank Gay

相关推荐

最近更新

标签

SQL GROUP BY 和 DISTINCT 之间有什么区别吗

提问by Brettski

采纳答案by Skeolan

回答by Andru Luvisi

回答by MusiGenesis

回答by Lukas Eder

What's the difference from a mere duplicate removal functionality point of view

从单纯的重复删除功能的角度来看有什么区别

1. It doesn't depend on the projection

1. 不依赖于投影

2. It cannot use any values from the projection

2. 它不能使用投影中的任何值

回答by jkramer

回答by Dave Costa

回答by The Light

回答by Joel Coehoorn

回答by Bill the Lizard

回答by Hank Gay

相关推荐

SQL 带回车的表中的数据？

正则表达式匹配常见的 SQL 语法？

SQL 查找列中不同值的数量

SQL 如何使用 INFORMATION_SCHEMA 找到默认约束？

相关推荐

最近更新

标签