SQL 在多列上计算 DISTINCT
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1471250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting DISTINCT over multiple columns
提问by Novitzky
Is there a better way of doing a query like this:
有没有更好的方法来做这样的查询:
SELECT COUNT(*)
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
FROM DocumentOutputItems) AS internalQuery
I need to count the number of distinct items from this table but the distinct is over two columns.
我需要计算该表中不同项目的数量,但不同项目超过两列。
My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)
我的查询工作正常,但我想知道是否可以仅使用一个查询(不使用子查询)来获得最终结果
采纳答案by Jason Horner
If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.
如果您想提高性能,您可以尝试在两列的散列值或串联值上创建持久计算列。
Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.
一旦它被持久化,如果该列是确定性的并且您使用的是“健全的”数据库设置,则可以对其进行索引和/或可以在其上创建统计信息。
I believe a distinct count of the computed column would be equivalent to your query.
我相信计算列的不同计数相当于您的查询。
回答by JayTee
Edit: Altered from the less-than-reliable checksum-only queryI've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable
编辑:从不太可靠的仅校验和查询改变我发现了一种方法来做到这一点(在 SQL Server 2005 中)对我来说效果很好,我可以根据需要使用尽可能多的列(通过将它们添加到CHECKSUM() 函数)。REVERSE() 函数将整数转换为 varchars 以使不同的更可靠
SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems
回答by APC
What is it about your existing query that you don't like? If you are concerned that DISTINCT
across two columns does not return just the unique permutations why not try it?
您不喜欢现有查询的哪些方面?如果您担心DISTINCT
跨两列不返回唯一的排列,为什么不试试呢?
It certainly works as you might expect in Oracle.
它确实像您在 Oracle 中所期望的那样工作。
SQL> select distinct deptno, job from emp
2 order by deptno, job
3 /
DEPTNO JOB
---------- ---------
10 CLERK
10 MANAGER
10 PRESIDENT
20 ANALYST
20 CLERK
20 MANAGER
30 CLERK
30 MANAGER
30 SALESMAN
9 rows selected.
SQL> select count(*) from (
2 select distinct deptno, job from emp
3 )
4 /
COUNT(*)
----------
9
SQL>
edit
编辑
I went down a blind alley with analytics but the answer was depressingly obvious...
我在分析方面走上了一条死胡同,但答案却令人沮丧地显而易见......
SQL> select count(distinct concat(deptno,job)) from emp
2 /
COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
9
SQL>
edit 2
编辑 2
Given the following data the concatenating solution provided above will miscount:
鉴于以下数据,上面提供的连接解决方案将错误计算:
col1 col2
---- ----
A AA
AA A
So we to include a separator...
所以我们要包含一个分隔符......
select col1 + '*' + col2 from t23
/
Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.
显然,选择的分隔符必须是一个字符或一组字符,它永远不会出现在任一列中。
回答by spelunk1
To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.
要作为单个查询运行,请连接列,然后获取连接字符串实例的不同计数。
SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
In MySQL you can do the same thing without the concatenation step as follows:
在 MySQL 中,您可以在没有连接步骤的情况下执行相同的操作,如下所示:
SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;
This feature is mentioned in the MySQL documentation:
MySQL 文档中提到了此功能:
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
回答by Trevor Tippins
How about something like:
怎么样:
select count(*) from (select count(*) cnt from DocumentOutputItems group by DocumentId, DocumentSessionId) t1
Probably just does the same as you are already though but it avoids the DISTINCT.
可能只是和你已经做的一样,但它避免了 DISTINCT。
回答by Alexander Kj?ll
Here's a shorter version without the subselect:
这是一个没有子选择的较短版本:
SELECT COUNT(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems
It works fine in MySQL, and I think that the optimizer has an easier time understanding this one.
它在 MySQL 中运行良好,我认为优化器更容易理解这一点。
Edit: Apparently I misread MSSQL and MySQL - sorry about that, but maybe it helps anyway.
编辑:显然我误读了 MSSQL 和 MySQL - 对此很抱歉,但也许无论如何它都有帮助。
回答by karmakaze
Many (most?) SQL databases can work with tuples like values so you can just do:
SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId))
FROM DocumentOutputItems;
If your database doesn't support this, it can be simulated as per @oncel-umut-turer's suggestion of CHECKSUM or other scalar function providing good uniqueness e.g.
COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId))
.
许多(大多数?)SQL 数据库可以使用像值这样的元组,因此您可以这样做:
SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId))
FROM DocumentOutputItems;
如果您的数据库不支持此功能,则可以根据 @oncel-umut-turer 的 CHECKSUM 或其他标量函数的建议进行模拟,以提供良好的唯一性例如
COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId))
。
A related use of tuples is performing IN
queries such as:
SELECT * FROM DocumentOutputItems
WHERE (DocumentId, DocumentSessionId) in (('a', '1'), ('b', '2'));
元组的一个相关用途是执行IN
查询,例如:
SELECT * FROM DocumentOutputItems
WHERE (DocumentId, DocumentSessionId) in (('a', '1'), ('b', '2'));
回答by Bliek
There's nothing wrong with your query, but you could also do it this way:
您的查询没有任何问题,但您也可以这样做:
WITH internalQuery (Amount)
AS
(
SELECT (0)
FROM DocumentOutputItems
GROUP BY DocumentId, DocumentSessionId
)
SELECT COUNT(*) AS NumberOfDistinctRows
FROM internalQuery
回答by IordanTanev
Hope this works i am writing on prima vista
希望这行得通,我正在 prima vista 上写作
SELECT COUNT(*)
FROM DocumentOutputItems
GROUP BY DocumentId, DocumentSessionId
回答by Jaanis Veinberg
I have used this approach and it has worked for me.
我已经使用了这种方法并且它对我有用。
SELECT COUNT(DISTINCT DocumentID || DocumentSessionId)
FROM DocumentOutputItems
For my case, it provides correct result.
就我而言,它提供了正确的结果。