SQL 我如何(或可以)在多列上选择 DISTINCT?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54418/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:19:10  来源:igfitidea点击:

How do I (or can I) SELECT DISTINCT on multiple columns?

sqlpostgresqlsql-updateduplicatesdistinct

提问by sheats

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.

我需要从表中检索所有行,其中 2 列组合都不同。所以我想要所有没有在同一天以相同价格发生的任何其他销售的销售。基于日期和价格的独特销售将更新为活动状态。

So I'm thinking:

所以我在想:

UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
             FROM sales
             HAVING count = 1)

But my brain hurts going any farther than that.

但我的大脑会痛得更远。

回答by Joel Coehoorn

SELECT DISTINCT a,b,c FROM t

is roughlyequivalent to:

大致等效于:

SELECT a,b,c FROM t GROUP BY a,b,c

It's a good idea to get used to the GROUP BY syntax, as it's more powerful.

习惯 GROUP BY 语法是个好主意,因为它更强大。

For your query, I'd do it like this:

对于您的查询,我会这样做:

UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
    SELECT id
    FROM sales S
    INNER JOIN
    (
        SELECT saleprice, saledate
        FROM sales
        GROUP BY saleprice, saledate
        HAVING COUNT(*) = 1 
    ) T
    ON S.saleprice=T.saleprice AND s.saledate=T.saledate
 )

回答by Erwin Brandstetter

If you put together the answers so far, clean up and improve, you would arrive at this superior query:

如果你把到目前为止的答案放在一起,清理和改进,你会得到这个更好的查询:

UPDATE sales
SET    status = 'ACTIVE'
WHERE  (saleprice, saledate) IN (
    SELECT saleprice, saledate
    FROM   sales
    GROUP  BY saleprice, saledate
    HAVING count(*) = 1 
    );

Which is muchfaster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).

这是很多比任何人更快。以因子 10 - 15(在我对 PostgreSQL 8.4 和 9.1 的测试中)削弱当前接受的答案的性能。

But this is still far from optimal. Use a NOT EXISTS(anti-)semi-join for even better performance. EXISTSis standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:

但这仍然远非最佳。使用NOT EXISTS(反)半连接以获得更好的性能。EXISTS是标准 SQL,一直存在(至少从 PostgreSQL 7.2 开始,早在提出这个问题之前)并且完全符合提出的要求:

UPDATE sales s
SET    status = 'ACTIVE'
WHERE  NOT EXISTS (
   SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS
   WHERE  s.saleprice = s1.saleprice
   AND    s.saledate  = s1.saledate
   AND    s.id <> s1.id                     -- except for row itself
   )
AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below

db<>fiddle here
Old SQL Fiddle

db<>fiddle here
旧 SQL 小提琴

Unique key to identify row

标识行的唯一键

If you don't have a primary or unique key for the table (idin the example), you can substitute with the system column ctidfor the purpose of this query (but not for some other purposes):

如果您没有表的主键或唯一键(id在示例中),您可以用系统列替换以ctid用于此查询(但不能用于其他目的):

   AND    s1.ctid <> s.ctid

Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serialor an IDENTITYcolumn in Postgres 10+.

每个表都应该有一个主键。如果您还没有,请添加一个。我建议Postgres 10+ 中的一个serial或一个IDENTITY列。

Related:

有关的:

How is this faster?

这如何更快?

The subquery in the EXISTSanti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes waymore efficient.

EXISTS反半连接中的子查询可以在找到第一个欺骗后立即停止评估(没有必要进一步查看)。对于几乎没有重复的基表,这只是稍微更有效。随着大量重复的这成为方式更有效。

Exclude empty updates

排除空更新

For rows that already have status = 'ACTIVE'this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHEREcondition like demonstrated above to avoid this and make it even faster:

对于已经具有status = 'ACTIVE'此更新的行不会更改任何内容,但仍会以全部成本插入新的行版本(适用较小的例外)。通常,您不希望这样。添加另一个WHERE如上所示的条件以避免这种情况并使其更快:

If statusis defined NOT NULL, you can simplify to:

如果status已定义NOT NULL,则可以简化为:

AND status <> 'ACTIVE';

Subtle difference in NULL handling

NULL 处理的细微差别

This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate)would qualify as "distinct" (though looking identical to the human eye):

此查询(与Joel 当前接受的答案不同)不会将 NULL 值视为相等。以下两行将(saleprice, saledate)被视为“不同”(尽管看起来与人眼相同):

(123, NULL)
(123, NULL)

Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:

还传入唯一索引和几乎任何其他位置,因为根据 SQL 标准,NULL 值比较不相等。看:

OTOH, GROUP BY, DISTINCTor DISTINCT ON ()treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROMinstead of =for any or all comparisons to make NULL compare equal. More:

OTOH, GROUP BYDISTINCTDISTINCT ON ()NULL值视为相等。根据您想要实现的目标,使用适当的查询样式。您仍然可以使用这个更快的查询IS NOT DISTINCT FROM代替=任何或所有比较,以使 NULL 比较相等。更多的:

If all columns being compared are defined NOT NULL, there is no room for disagreement.

如果所有要比较的列都已定义NOT NULL,则没有分歧的余地。

回答by Christian Berg

The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.

您的查询的问题在于,当使用 GROUP BY 子句(您基本上是通过使用 distinct 来实现的)时,您只能使用分组依据或聚合函数的列。您不能使用列 id,因为可能存在不同的值。在您的情况下,由于 HAVING 子句,始终只有一个值,但大多数 RDBMS 不够聪明,无法识别。

This should work however (and doesn't need a join):

但是,这应该有效(并且不需要加入):

UPDATE sales
SET status='ACTIVE'
WHERE id IN (
  SELECT MIN(id) FROM sales
  GROUP BY saleprice, saledate
  HAVING COUNT(id) = 1
)

You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.

您也可以使用 MAX 或 AVG 而不是 MIN,只有在只有一个匹配行时才使用返回列值的函数才是重要的。

回答by frans eilering

I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using

我想从一列“GrondOfLucht”中选择不同的值,但它们应该按照“排序”列中给出的顺序进行排序。我无法使用仅获得一列的不同值

Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering

It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.

它还会给列“排序”,因为“GrondOfLucht”和“排序”不是唯一的,结果将是所有行。

use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering

使用 GROUP 以“排序”给出的顺序选择“GrondOfLucht”的记录

SELECT        GrondOfLucht
FROM            dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

回答by Abdulhafeth Sartawi

If your DBMS doesn't support distinct with multiple columns like this:

如果您的 DBMS 不支持像这样的多列不同:

select distinct(col1, col2) from table

Multi select in general can be executed safely as follows:

多选一般可以安全地执行如下:

select distinct * from (select col1, col2 from table ) as x

As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.

因为这可以在大多数 DBMS 上工作,并且由于您避免使用分组功能,因此预计这比按解决方案分组要快。