SQL SELECT 列表中的所有列都必须出现在 GROUP BY 子句中吗
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5986127/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Do all columns in a SELECT list have to appear in a GROUP BY clause
提问by user559142
My lecturer stated:
我的讲师说:
All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function
SELECT 列表中的所有列名都必须出现在 GROUP BY 子句中,除非名称仅用于聚合函数
I'm just wanting some confirmation of this as I cannot think of a logical explanation as to why it should be true...
我只是想对此进行一些确认,因为我想不出一个合乎逻辑的解释为什么它应该是真的......
回答by Rob P.
Imagine the following:
想象一下:
A B C
Cat 10 False
Dog 25 True
Dog 20 False
Cat 5 False
If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?
如果您选择 A、B 和仅按 A 分组 - 您的输出会是什么?您只有两行(或元组),因为 A 有两个值 - 但它如何显示 B?
If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:
如果你按 A、B 分组,你会得到四行,那里没有问题。如果您按 A 分组并在 B 上执行一个函数 - 如 SUM(B) 那么您将再次获得两行:
Cat 15
Dog 45
But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.
但是,如果您选择 A、B 并且仅按 A 分组 - 它不知道该怎么做。说实话,我相信有一些数据库会在这种情况下为 B 选择一个随机值,我相信有些数据库会给你一个错误信息。
回答by Mike Sherrill 'Cat Recall'
That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.
这在历史上是正确的。省略未聚合的列会导致不确定的行为。SQL 旨在完全确定行为。
But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that arein the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.
但SQL标准最近已更改,让你从GROUP BY忽略那些功能依赖于该列子句中的列是在GROUP BY。PostgreSQL 遵循更新的 SQL 标准。(这不是唯一的。)行为仍然是完全确定的。
create table a (
a_id integer primary key,
xfr_date date not null
);
create table b (
a_id integer not null references a (a_id),
recd_date date not null,
units_recd integer not null
check (units_recd >= 0),
primary key (a_id, recd_date)
);
select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent
-- on a.a_id; it doesn't have to appear in the
-- GROUP BY clause.
The notable deviation from from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.
与 SQL 标准的显着偏差是 MySQL。它让您可以省略 GROUP BY 中的几乎所有内容。但是,当您省略 SELECT 列表中的列时,这种设计选择会使其行为不确定。
回答by Kibbee
Actually, in MySQL you don't have to group by all columns. You can just group by whatever columns you want. The problem is, it will just pull a random value (from the set of available rows in the group) for the fields which aren't in the group by. If you know that you are grouping by something that is a unique key, there's no point in grouping by the rest of the fields, as they will already all have the same value anyway. It can actually speed it up to not have to group by every field when it is completely unnecessary.
实际上,在 MySQL 中,您不必按所有列进行分组。您可以按您想要的任何列进行分组。问题是,它只会为不在 group by 中的字段提取一个随机值(从组中的可用行集中)。如果您知道要按唯一键进行分组,则按其余字段分组就没有意义了,因为无论如何它们都已经具有相同的值。当完全不需要时,它实际上可以加快速度,不必按每个字段分组。
回答by Michael Will
So the simple answer is: It depends. Mysql allows it, vertica doesn't.
所以简单的答案是:这取决于。Mysql 允许,vertica 不允许。
There is actually a valid use case for omitting and that is when you are already selecting say with MIN().
实际上有一个有效的省略用例,那就是当您已经选择 say with MIN() 时。
Here is an actual example for event tracking. Imaging you have credit and purchase events.
这是事件跟踪的实际示例。想象你有信用和购买事件。
For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:
为简单起见,我们说 a=credit,b,c,d 是某种购买事件,时间用一个运行数字来跟踪。现在您要查找每次信用后首次购买的日期。我们也碰巧只有一个客户 0:
create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');
mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
looks good in mysql, does not work in vertica:
在 mysql 中看起来不错,在 vertica 中不起作用:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
错误 2640:列“e2.event”必须出现在 GROUP BY 子句中或用于聚合函数中
if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.
如果我省略事件列,它在两者中都有效,但我确实想知道 min 选择的行的特定值事件。
So my answer ends with a request for comment :) Any ideas?
所以我的回答以征求意见结束:) 有什么想法吗?
回答by hsmiths
If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.
如果您对某些内容进行分组,则无法看到未分组列的各个值,因为每个组中可能有多个值。您所能做的就是报告聚合函数(sum、count、min 等)——这些函数能够将多个值组合到结果中的单个单元格中。
回答by Karl
There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.
Sam Saffron 指出也有例外,但通常你的讲师所说的是真的。
If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?
如果我选择 3 列并按 2 分组,RDBMS 应该如何处理第 3 列?
The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.
RDBMS 的开发人员可能会决定如何处理额外的列(就像 MySQL 的开发人员所做的那样),但这是我在编写选择时做出的决定还是我想要的决定?决定总是有效的吗?我当然更喜欢强迫我明确说明应该发生什么的类似 Oracle 的方法。
If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?
如果我选择 3 列并按 2 分组,RDBS 是否应按全部 3 分组,从第 3 个、最大或最小、最常见的随机值中选择一个?
回答by user3133172
This is an answer for Michael Will's example/question.
这是迈克尔威尔的例子/问题的答案。
SELECT
e3.user_id,
MAX(e3.purchased) AS purchased,
e3.spent,
e.event
FROM
events e
INNER JOIN
(SELECT
e1.user_id AS user_id,
MIN(e1.created_at) as spent,
e2.created_at as purchased
FROM
events e1
INNER JOIN
(SELECT e.user_id, e.created_at from events e WHERE e.event = 'a') e2
ON e1.user_id = e2.user_id
AND e1.created_at >= e2.created_at
AND e1.event != 'a'
GROUP BY e1.User_ID, e2.created_at
) e3
ON e.user_id = e3.user_id AND e.created_at = e3.spent
GROUP BY e3.user_id, e3.spent, e.event;