SQL 从红移中删除重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37582261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Deleting duplicates rows from redshift
提问by Neil
I am trying to delete some duplicate data in my redshift table.
我正在尝试删除 redshift 表中的一些重复数据。
Below is my query:-
以下是我的查询:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
This query is giving me an error.
这个查询给了我一个错误。
Amazon Invalid operation: syntax error at or near "delete";
亚马逊无效操作:“删除”处或附近的语法错误;
Not sure what the issue is as the syntax for with clause seems to be correct. Has anybody faced this situation before?
不确定问题是什么,因为 with 子句的语法似乎是正确的。以前有人遇到过这种情况吗?
回答by systemHyman
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.
Redshift 就是它(任何列都没有强制唯一性),Ziggy 的第三个选项可能是最好的。一旦我们决定走临时表路线,就可以更有效地将东西全部换掉。Redshift 中的删除和插入操作开销很大。
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.
如果空间不是问题,您可以将旧表保留一段时间,并使用此处描述的其他方法来验证原始记录中的行数是否与新表中的行数匹配。
If you're doing constant loads to such a table you'll want to pause that process while this is going on.
如果您对这样的表进行持续加载,您将希望在此过程中暂停该过程。
If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then appendthe temp table back to the original. Make sure you vacuumthe original table after (which you should be doing for large tables on a schedule anyway).
如果重复的数量占大表的一小部分,您可能想要尝试将重复的不同记录复制到临时表,然后从原始表中删除与临时连接的所有记录。然后将临时表附加回原始表。确保在之后对原始表进行真空吸尘(无论如何,您应该按计划对大型表执行此操作)。
回答by Elliot Chance
If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:
如果您正在处理大量数据,重新创建整个表并不总是可能或明智的。查找、删除这些行可能更容易:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Full article: https://elliot.land/post/removing-duplicate-data-in-redshift
全文:https: //elliot.land/post/removing-duplicate-data-in-redshift
回答by Ziggy Crueltyfree Zeitgeister
That should have worked. Alternative you can do:
那应该有效。替代方案你可以这样做:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
or
或者
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
If you have no primary key, you can do the following:
如果您没有主键,您可以执行以下操作:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
回答by Shivam Tyagi
Simple answer to this question:
这个问题的简单回答:
- Firstly create a temporary table from the main table where value of
row_number=1
. - Secondly
delete
all the rows from the main table on which we had duplicates. - Then insert the values of temporary table into the main table.
- 首先从主表创建一个临时表,其中值为
row_number=1
. - 其次
delete
是我们有重复的主表中的所有行。 - 然后将临时表的值插入到主表中。
Queries:
查询:
Temporary table
select id,date into #temp_a from (select *
from (select a.*, row_number() over(partition by id order by etl_createdon desc) as rn from table a where a.id between 59 and 75 and a.date = '2018-05-24') where rn =1)adeleting all the rows from the main table.
delete from table a where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table
insert into table a select * from #temp_a
.
临时表
select id,date into #temp_a from (select *
from (select a.*, row_number() over(partition by id order by etl_createdon desc) as rn from table a where a.id between 59 and 75 and a.date = '2018-05-24') where rn =1)a从主表中删除所有行。
delete from table a where a.id between 59 and 75 and a.date = '2018-05-24'
将临时表中的所有值插入到主表
insert into table a select * from #temp_a
.
回答by Phil Scalo
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:
以下删除 'tablename' 中所有重复的记录,它不会对表进行重复数据删除:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
回答by Matthijs
Your query does not work because Redshift does not allow DELETE
after the WITH
clause. Only SELECT
and UPDATE
and a few others are allowed (see WITH clause)
您的查询不起作用,因为 Redshift 不允许DELETE
在WITH
子句之后。只允许SELECT
andUPDATE
和其他一些(见WITH 子句)
Solution (in my situation):
解决方案(在我的情况下):
I did have an id column on my table events
that contained duplicate rows and uniquely identifies the record. This column id
is the same as your record_indicator
.
我的表上确实有一个 id 列,events
其中包含重复的行并唯一标识记录。此列id
与您的record_indicator
.
Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT
:
不幸的是,我无法创建临时表,因为我在使用时遇到了以下错误SELECT DISTINCT
:
ERROR: Intermediate result row exceeds database block size
ERROR: Intermediate result row exceeds database block size
But this worked like a charm:
但这就像一个魅力:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
resulting in the temp
table:
导致temp
表:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
Now the duplicates can be deleted by removing the rows having rownumber
larger than 1:
现在可以通过删除rownumber
大于 1的行来删除重复项:
DELETE FROM temp WHERE rownumber > 1
After that rename the tables and your done.
之后重命名表格并完成。
回答by Jai
This method will preserve permissions and the table definition of the original_table
此方法将保留权限和表定义 original_table
- Create Table with unique rows
- 创建具有唯一行的表
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
- Backup the
original_table
- 备份
original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
- Truncate the
original_table
- 截断
original_table
TRUNCATE original_table
- Insert records from
unique_table
intooriginal_table
- 插入从记录
unique_table
到original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;