从 Mysql 表中删除重复的行并只保留一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16030470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 17:21:34  来源:igfitidea点击:

Delete Duplicate rows from Mysql table and Kept only one row

mysql

提问by Axeem

I want to delete all duplicate rows from Mysql table.
But problem is that I did not know which rows are duplicate.
This Mysql table contain a large data round about 500000 rows.
In which some rows are duplicate.
Please guide me how can I do this.

我想从 Mysql 表中删除所有重复的行。
但问题是我不知道哪些行是重复的。
这个 Mysql 表包含大约 500000 行的大数据。
其中某些行是重复的。
请指导我如何做到这一点。

UPDATE:

更新:

I need SQL query which I run in phpMyAdmin.
Here is a rough table to understand.
Suppose table name is foo.

我需要在 phpMyAdmin 中运行的 SQL 查询。
这是一个粗略的表格来理解。
假设表名是foo

+---------------------------------------------------------------------+
| id |   link  |     title              |  description                |
+---------------------------------------------------------------------+
| 1  |  google |     search engine      |  search here free           |  
| 2  |  yahoo  |    also search engine  | findout web easily          |  
| 3  | Facebook|  connect with world    | meet with world             |  
| 4  | google  |  search engine         |  search here free           |
| 5  | msn     | Microsoft network      | network by MS               |
| 6  | google  | search engine          |  search here free           |
| 7  | msn     | Microsoft network      | network by MS               |
| 8  | yahoo   |  also search engine    | findout web easily          |
| 9  | myweb   |  my website            | ideal website               |
|... | ....    | .....continue....      | ..... ... .....             |
+---------------------------------------------------------------------+   

This is a rough table I can not define my table fully because it had round about 500000
rows.Hope this understand you what I want.
I rough query like that.

这是一张粗略的表格,我无法完全定义我的表格,因为它大约有 500000
行。希望这能理解您的需求。
我这样粗略的查询。

DELECT all duplicate rows FROM foo

EDIT
I see this questoin is marked as duplicate.But I think it is unique.The link with this you compare it dulpicate.I see this link and there is answer which is marked for usefull for small size table it make changes in index and make unique index. Here is the code of that

编辑
我看到这个 questoin 被标记为重复。但我认为它是独一无二的。与这个链接你比较它重复。我看到这个链接并且有答案被标记为对小尺寸表有用,它在索引中进行更改并制作唯一索引。这是代码

  ALTER IGNORE TABLE jobs ADD UNIQUE INDEX idx_name (site_id, title, company );  

If you run this query after this when ever you add any data Mysql check out that it is already present or not if present it stop to adding this.
I already told you that my table is contaning a large data if use this query and after that I add one result in my table it check my whole table in which round about 500000 rows for make it uniquness this make it slow.It is only one new record if this is 10 or Whats your idea about if I want to enter 100000 new records it make too slower it.
And I see other answer most of them are containing HAVINGclass.It is already slow.

如果您在此之后在添加任何数据时运行此查询,Mysql 检查它是否已经存在,如果存在,它会停止添加它。
我已经告诉过你,如果使用这个查询,我的表包含大量数据,然后我在我的表中添加一个结果,它会检查我的整个表,其中大约有 500000 行以使其唯一,这使它变慢。只有一个新记录,如果这是 10 条,或者如果我想输入 100000 条新记录,你有什么想法,它会变得太慢。
我看到其他答案大部分都包含类HAVING。它已经很慢了。

回答by Kickstart

You can probably do it using a JOIN in a DELETE, joining against a subselect.

您可能可以在 DELETE 中使用 JOIN 来完成,并针对子选择加入。

More details are required to give much help, but for a rough idea:-

需要更多细节才能提供很多帮助,但对于一个粗略的想法:-

DELETE result 
FROM result
INNER JOIN (SELECT SomeField, COUNT(*) AS RecCount, MAX(DateAddded) AS MaxDateAdded FROM result GROUP BY SomeField) b
ON result.SomeField = b.SomeField AND a.DateAdded != b.MaxDateAdded

This is finding every occurance of SomeFieldwith their corresponding max date added and deleting any where there isn't a match on that max date added.

这是查找SomeField 的每次出现,并添加相应的最大日期,并删除与添加的最大日期不匹配的任何内容。

I assume that you want to keep the latest record.

我假设您想保留最新记录。

Note that mass deletes like this are a bit worrying, given that if you get it wrong you potentially delete all your records.

请注意,像这样的批量删除有点令人担忧,因为如果您弄错了,您可能会删除所有记录。

EDIT - version to go with the table you have now given. This will delete the duplicates, just leaving you with the first one that is the same (ie for Google you just get left with the row with id of 1)

编辑 - 与您现在给出的表格搭配的版本。这将删除重复项,只留下第一个相同的(即对于 Google,您只剩下 id 为 1 的行)

DELETE foo 
FROM foo
INNER JOIN (SELECT link, title, description, MIN(id) AS MinId FROM foo GROUP BY link, title, description ) b
ON foo.link = b.link
AND foo.title = b.title 
AND foo.description = b.description
AND foo.id != b.MinId

回答by Vorsprung

using a scripting language such as perl or python

使用脚本语言,如 perl 或 python

1) do a query to load each row

1)做一个查询来加载每一行

2) calculate a checksum for all the fields that should be unique. For example, if the name and cost should only have one entry per unique, calculate the checksum on this. A hashing algorithm like md5 would be ideal for this

2)计算所有应该唯一的字段的校验和。例如,如果名称和成本每个唯一条目应该只有一个条目,请计算校验和。像 md5 这样的散列算法将是理想的

3) store each checksum with the "id" or some way that the row can be identified for a later delete

3)用“id”或某种方式存储每个校验和,以便以后删除该行

4) have a system of rules for resolving which one of a set of non unique records is allowed

4)有一个规则系统来解决允许一组非唯一记录中的哪一个

5) once the query processing is complete and all the duplicates have been found, use the rules and the set of id / checksums to generate a SQL delete command (or a set of delete commands)

5) 一旦查询处理完成并找到所有重复项,使用规则和一组 id / 校验和生成 SQL 删除命令(或一组删除命令)

6) execute the delete commands(s)

6) 执行删除命令

7) alter your database structure so there is a unique constraint to prevent this happening again

7) 改变你的数据库结构,以便有一个独特的约束来防止这种情况再次发生

The database only has 500 000 rows so storing id information and checksums is well within the capabilities of most scripting languages

数据库只有 500 000 行,因此存储 id 信息和校验和完全在大多数脚本语言的能力范围内