SQL 对表进行重复数据删除的最佳方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2230295/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:19:59  来源:igfitidea点击:

What's the best way to dedupe a table?

sqlalgorithmperformanceduplicates

提问by froadie

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

我已经看到了一些解决方案,但我想知道最好和最有效的方法是对表进行重复数据删除。您可以使用代码(SQL 等)来说明您的观点,但我只是在寻找基本算法。我认为 SO 上已经有关于此的问题,但我找不到,所以如果它已经存在,请提醒我。

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

(只是为了澄清 - 我指的是摆脱表中的重复项,该表具有增量自动 PK,并且有一些行在除 PK 字段之外的所有内容中都是重复的。)

回答by Hank Gay

SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTOor CREATE TABLE ASpattern available), then blow away the old table and pump the data from the temp table back into it.

SELECT DISTINCT <insert all columns but the PK here> FROM foo. 使用该查询创建一个临时表(语法因 RDBMS 而异,但通常有一个SELECT … INTOorCREATE TABLE AS模式可用),然后删除旧表并将临时表中的数据泵回其中。

回答by Katherine

Using analytic function row_number:

使用解析函数 row_number:

WITH CTE (col1, col2, dupcnt)
AS
(
SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS dupcnt
FROM Youtable
)
DELETE
FROM CTE
WHERE dupcnt > 1
GO                                                                 

回答by HLGEM

Deduping is rarely simple. That's because the records to be dedupped often have slightly different values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

重复数据删除很少是简单的。那是因为要删除重复数据的记录在某些字段中的值通常略有不同。因此,选择要保留的记录可能会出现问题。此外,重复通常是人的记录,很难确定这两个约翰史密斯是两个人还是一个被复制的人。因此,花费大量时间(整个项目的 50% 或更多)来定义什么构成重复以及如何处理差异和子记录。

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

你怎么知道哪个是正确的值?进一步的重复数据删除要求您处理所有子记录而不是孤立任何记录。当您发现通过更改子记录上的 id 突然违反了唯一索引或约束之一时会发生什么 - 这最终会发生并且您的流程需要处理它。如果您愚蠢地选择仅在整个应用程序中应用所有约束,您甚至可能不知道违反了约束。当您有 10,000 条记录需要重复数据删除时,您不会通过应用程序一次删除一条记录。如果约束不在数据库中,那么在重复数据删除时保持数据完整性很幸运。

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

更复杂的是,重复并不总是与姓名或地址完全匹配。例如,名为 Joan Martin 的销售代表可能是名为 Joan Martin-Jones 的销售代表的重复,特别是如果他们具有相同的地址和电子邮件。或者你可以在名字中加入约翰或约翰尼。或者相同的街道地址,除了一条缩写为 ST 的记录。一个拼写出街。在 SQL Server 中,您还可以使用 SSIS 和模糊分组来识别接近匹配。这些通常是最常见的重复,因为不完全匹配的事实是它们首先被放入重复的原因。

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in two or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name.

对于某些类型的重复数据删除,您可能需要一个用户界面,以便执行重复数据删除的人员可以选择两个值中的哪一个用于特定字段。如果被去重的人担任两个或多个角色,则尤其如此。可能是特定角色的数据通常比另一个角色的数据更好。或者可能只有用户才能确定哪个是正确的值,或者他们可能需要联系人们以查明他们是真的被骗还是只是两个同名的人。

回答by DropHit

Adding the actual code here for future reference

在此处添加实际代码以供将来参考

So, there are 3 steps, and therefore 3 SQL statements:

因此,有 3 个步骤,因此有 3 个 SQL 语句:

Step 1: Move the non duplicates (unique tuples) into a temporary table

第 1 步:将非重复项(唯一元组)移动到临时表中

CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

Step 2: delete the old table (or rename it) We no longer need the table with all the duplicate entries, so drop it!

第 2 步:删除旧表(或重命名) 我们不再需要包含所有重复条目的表,所以删除它!

DROP TABLE old_table;

Step 3: rename the new_table to the name of the old_table

第三步:将new_table重命名为old_table的名字

RENAME TABLE new_table TO old_table;

And of course, don't forget to fix your buggy code to stop inserting duplicates!

当然,不要忘记修复您的错误代码以停止插入重复项!

回答by DShook

Here's the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness:

这是我使用的方法,如果您可以将重复条件放入 group by 语句并且您的表具有唯一性的 id 标识列:

delete t
from tablename t
inner join  
(
    select date_time, min(id) as min_id
    from tablename
    group by date_time
    having count(*) > 1
) t2 on t.date_time = t2.date_time
where t.id > t2.min_id

In this example the date_time is the grouping criteria, if you have more than one column make sure to join on all of them.

在这个例子中 date_time 是分组标准,如果你有不止一列,请确保加入所有列。

回答by Taylor Brown

I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.

我正在从 DShook 中获取一个并提供一个重复数据删除示例,您将仅保留具有最高日期的记录。

In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:

在这个例子中,我有 3 条记录都具有相同的 app_id,我只想保留日期最高的一条:

DELETE t
FROM @USER_OUTBOX_APPS t
INNER JOIN  
(
    SELECT 
         app_id
        ,max(processed_date) as max_processed_date
    FROM @USER_OUTBOX_APPS
    GROUP BY app_id
    HAVING count(*) > 1
) t2 on 
    t.app_id = t2.app_id
WHERE 
    t.processed_date < t2.max_processed_date

回答by Demian Perry

For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:

对于那些喜欢快速而肮脏的方法的人,只需列出一起定义唯一记录的所有列并使用这些列创建唯一索引,如下所示:

ALTER IGNORE TABLE TABLE_NAMEADD UNIQUE (column1,column2,column3)

ALTER IGNORE TABLE TABLE_NAMEADD UNIQUE ( column1, column2, column3)

You can drop the unique index afterwords.

您可以删除唯一索引后记。

回答by Jim X.

This can dedupe the duplicated values in c1:

这可以对 中的重复值进行重复数据删除c1

select * from foo
minus
select f1.* from foo f1, foo f2
where f1.c1 = f2.c1 and f1.c2 > f2.c2

回答by FrustratedWithFormsDesigner

You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

您可以为每一行(不包括 PK)生成一个散列,将其存储在一个新列中(或者如果您无法添加新列,您可以将表移动到临时暂存区吗?),然后查找所有其他具有相同散列的行。当然,您必须能够确保您的哈希函数不会为不同的行生成相同的代码。

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

如果两行是重复的,删除哪一行有关系吗?其他数据是否可能依赖于两个副本?如果是这样,您将必须执行以下几个步骤:

  • Find the dupes
  • Choose one of them as dupeAto eliminate
  • Find all data dependent on dupeA
  • Alter that data to refer to dupeB
  • delete dupeA.
  • 找到骗子
  • 选择其中之一作为dupeA消除
  • 查找依赖于的所有数据 dupeA
  • 更改要引用的数据 dupeB
  • 删除dupeA

This could be easy or complicated, depending on your existing data model.

这可能很简单也可能很复杂,具体取决于您现有的数据模型。

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!

整个场景听起来像是一个维护和重新设计项目。如果是这样,祝你好运!!

回答by ron

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

对于 SQL,您可以使用 INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.

对于一个算法,如果你可以假设待主键可能重复,但待主键唯一标识行的内容,那么只散列待主键并检查重复.