oracle SQL 查找重复条目(组内)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/182544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL to find duplicate entries (within a group)
提问by FrankS
SQL to find duplicate entries (within a group)
SQL 查找重复条目(组内)
I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself. In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.
我有一个小问题,我不确定解决它的最佳方法是什么,因为我对数据库 (Oracle) 本身的访问权限有限。在我们的表“EVENT”中,我们有大约 160k 个条目,每个 EVENT 都有一个 GROUPID,而一个普通条目正好有 5 行具有相同的 GROUPID。由于一个错误,我们目前得到了几个重复的条目(重复,所以 10 行而不是 5 行,只是一个不同的 EVENTID。这可能会改变,所以它只是 <> 5)。我们需要过滤这些组的所有条目。
Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.
由于对数据库的访问受限,我们不能使用临时表,也不能向 GROUPID 列添加索引以使其更快。
We can get the GROUPIDs with this query, but we would need a second query to get the needed data
我们可以通过这个查询获取 GROUPID,但我们需要第二个查询来获取所需的数据
select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5
One solution would be a subselect:
一种解决方案是子选择:
select *
from "EVENT" A
where A."GROUPID" IN (
select B."GROUPID"
from "EVENT" B
group by B."GROUPID"
having count(B."GROUPID") <> 5
)
Without an index on GROUPID and 160k entries, this takes much too long. Tried thinking about a join that can handle this, but can't find a good solution so far.
如果没有 GROUPID 和 160k 条目的索引,这将花费太长时间。尝试考虑可以处理此问题的连接,但到目前为止找不到好的解决方案。
Anybody can find a good solution for this maybe?
任何人都可以为此找到一个好的解决方案吗?
Small edit: We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)
小编辑:我们这里没有 100% 重复,因为每个条目仍然有一个唯一的 ID,而 GROUPID 也不是唯一的(这就是我们需要使用“分组依据”的原因) - 或者我只是错过了一个简单的解决方案它 :)
Small example about the data (I don't want to delete it, just find it)
关于数据的小例子(我不想删除它,只是找到它)
EVENTID | GROUPID | TYPEID
123456 123 12
123457 123 145
123458 123 2612
123459 123 41
123460 123 238
234567 123 12
234568 123 145
234569 123 2612
234570 123 41
234571 123 238
It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.
EVENTID | GROUPID | TYPEID
123456 123 12
123457 123 145
123458 123 2612
123459 123 41
123460 123 238
234567 123 12
234568 123 145
234569 123 2612
234570 123 41
234571 123 238
它有更多的列,如时间戳等,但正如您已经看到的,除了 EVENTID 之外,所有内容都是相同的。
We will run it more often for testing, to find the bug and check if it happens again.
我们将更频繁地运行它进行测试,以找到错误并检查它是否再次发生。
回答by David Aldridge
A classic problem for analytic queries to solve:
分析查询要解决的一个经典问题:
select eventid,
groupid,
typeid
from (
Select eventid,
groupid,
typeid,
count(*) over (partition by group_id) count_by_group_id
from EVENT
)
where count_by_group_id <> 5
回答by Walter Mitty
You can get the answer with a join instead of a subquery
您可以通过连接而不是子查询获得答案
select
a.*
from
event as a
inner join
(select groupid
from event
group by groupid
having count(*) <> 5) as b
on a.groupid = b.groupid
This is a fairly common way of obtaining the all the information out of the rows in a group.
这是从组中的行中获取所有信息的一种相当常见的方法。
Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.
就像您建议的答案和其他回复一样,使用 groupid 上的索引会运行得更快。由 DBA 来平衡使查询运行得更快的好处与维护另一个索引的成本。
If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.
如果 DBA 决定不使用索引,请确保相关人员了解索引策略而不是您编写查询的方式会减慢速度。
回答by Tony Andrews
How long does that SQL actually take? You are only going to run it once I presume, having fixed the bug that caused the corruption in the first place? I just set up a test case like this:
该 SQL 实际需要多长时间?我认为您只会在首先修复导致损坏的错误后运行它?我只是设置了一个这样的测试用例:
SQL> create table my_objects as
2 select object_name, ceil(rownum/5) groupid, rpad('x',500,'x') filler
3 from all_objects;
Table created.
SQL> select count(*) from my_objects;
COUNT(*)
----------
83782
SQL> select * from my_objects where groupid in (
2 select groupid from my_objects
3 group by groupid
4 having count(*) <> 5
5 );
OBJECT_NAME GROUPID FILLER
------------------------------ ---------- --------------------------------
XYZ 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
YYYY 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed: 00:00:01.67
Less than 2 seconds. OK, my table has half as many rows as yours, but 160K isn't huge. I added the filler column to make the table take up some disk space. The AUTOTRACE execution plan was:
不到2秒。好的,我的表的行数是你的一半,但 160K 并不大。我添加了填充列以使表占用一些磁盘空间。AUTOTRACE 执行计划是:
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 389 | 112K| 14029 (2)|
|* 1 | HASH JOIN | | 389 | 112K| 14029 (2)|
| 2 | VIEW | VW_NSO_1 | 94424 | 1198K| 6570 (2)|
|* 3 | FILTER | | | | |
| 4 | HASH GROUP BY | | 1 | 1198K| 6570 (2)|
| 5 | TABLE ACCESS FULL| MY_OBJECTS | 94424 | 1198K| 6504 (1)|
| 6 | TABLE ACCESS FULL | MY_OBJECTS | 94424 | 25M| 6506 (1)|
-------------------------------------------------------------------------
回答by paxdiablo
If your DBAs won't add an index to make this faster, ask them what they suggest you do (that's what they're paid for, after all). Presumably you have a business case why you need this information in which case your immediate management should be on your side.
如果您的 DBA 不会添加索引来加快速度,请询问他们建议您做什么(毕竟,这是他们的报酬)。大概您有一个商业案例,为什么您需要此信息,在这种情况下,您的直接管理层应该站在您这边。
Perhaps you could ask your DBAs to duplicate the data into a database where you couldadd an index.
也许您可以要求您的 DBA 将数据复制到您可以添加索引的数据库中。
回答by Michael OShea
From a SQL perspective I think you've already answered your own question. The approach you've described (ie using the sub-select) is fine, and I'd be surprised if any other way of writing the query differed vastly in performance.
从 SQL 的角度来看,我认为您已经回答了您自己的问题。您描述的方法(即使用子选择)很好,如果编写查询的任何其他方式在性能上有很大差异,我会感到惊讶。
160K records doesn't seem like a lot to me. I could understand if you were unhappy with the performance of that query if it was going into a piece of application code, but from the sounds of it you're just using it as part of some data cleansing excercise. (and so would expect you to be a little more tolerant in performance terms).
160K 记录对我来说似乎并不多。我可以理解您是否对该查询的性能不满意,如果它进入一段应用程序代码,但从它的声音来看,您只是将它用作某些数据清理练习的一部分。(因此希望您在性能方面更宽容一些)。
Even without any supporting index, its still just two full table table scans on 160K rows, which frankly, I'd expect to perform in some sort of vaguely reasonable time.
即使没有任何支持索引,它仍然只是对 160K 行进行两次全表扫描,坦率地说,我希望在某种模糊合理的时间内执行。
Talk to your db administrators. They've helped create the problem, so let them be part of the solution.
与您的数据库管理员交谈。他们帮助制造了问题,所以让他们成为解决方案的一部分。
/EDIT/ In the meantime, run the query you have. Find out how long it takes, rather than guessing. Even better would be to run it, with set autotrace on, and post the results here, then we might be able to help you refine it somewhat.
/EDIT/ 同时,运行您的查询。找出需要多长时间,而不是猜测。更好的是运行它,打开 set autotrace,并在此处发布结果,然后我们可能会帮助您稍微改进它。
回答by ilitirit
Does this work do what you want, and does it offer better performance? (I just thought I'd throw it in as a suggestion).
这项工作是否符合您的要求,是否提供了更好的性能?(我只是想我会把它作为一个建议)。
select *
from group g
where (select count(*) from event e where g.groupid = e.groupid) <> 5
回答by Dave Costa
How about an analytic:
怎么分析:
SELECT * FROM (
SELECT eventid, groupid, typeid, COUNT(groupid) OVER (PARTITION BY groupid) group_count
FROM event
)
WHERE group_count <> 5