SQL 从红移中删除重复的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37582261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:39:09  来源:igfitidea点击:

Deleting duplicates rows from redshift

sqlamazon-redshiftsql-delete

提问by Neil

I am trying to delete some duplicate data in my redshift table.

我正在尝试删除 redshift 表中的一些重复数据。

Below is my query:-

以下是我的查询:-

With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;

This query is giving me an error.

这个查询给了我一个错误。

Amazon Invalid operation: syntax error at or near "delete";

亚马逊无效操作:“删除”处或附近的语法错误;

Not sure what the issue is as the syntax for with clause seems to be correct. Has anybody faced this situation before?

不确定问题是什么,因为 with 子句的语法似乎是正确的。以前有人遇到过这种情况吗?

回答by systemHyman

Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.

Redshift 就是它(任何列都没有强制唯一性),Ziggy 的第三个选项可能是最好的。一旦我们决定走临时表路线,就可以更有效地将东西全部换掉。Redshift 中的删除和插入操作开销很大。

begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;

If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.

如果空间不是问题,您可以将旧表保留一段时间,并使用此处描述的其他方法来验证原始记录中的行数是否与新表中的行数匹配。

If you're doing constant loads to such a table you'll want to pause that process while this is going on.

如果您对这样的表进行持续加载,您将希望在此过程中暂停该过程。

If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then appendthe temp table back to the original. Make sure you vacuumthe original table after (which you should be doing for large tables on a schedule anyway).

如果重复的数量占大表的一小部分,您可能想要尝试将重复的不同记录复制到临时表,然后从原始表中删除与临时连接的所有记录。然后临时表附加回原始表。确保在之后对原始表进行真空吸尘(无论如何,您应该按计划对大型表执行此操作)。

回答by Elliot Chance

If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:

如果您正在处理大量数据,重新创建整个表并不总是可能或明智的。查找、删除这些行可能更容易:

-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;

-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);

INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
     SELECT saleid
     FROM duplicate_saleids
);

-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
     SELECT saleid
     FROM duplicate_saleids
);

-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;

-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;

COMMIT;

Full article: https://elliot.land/post/removing-duplicate-data-in-redshift

全文:https: //elliot.land/post/removing-duplicate-data-in-redshift

回答by Ziggy Crueltyfree Zeitgeister

That should have worked. Alternative you can do:

那应该有效。替代方案你可以这样做:

With 
  duplicates As (
    Select *, ROW_NUMBER() Over (PARTITION by record_indicator
                                 Order by record_indicator) as Duplicate
    From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);

or

或者

delete from table_name
where id in (
  select id
  from (
    Select id, ROW_NUMBER() Over (PARTITION by record_indicator
                                 Order by record_indicator) as Duplicate
    From table_name) x
  Where Duplicate > 1);

If you have no primary key, you can do the following:

如果您没有主键,您可以执行以下操作:

BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
  SELECT DISTINCT ON (record_indicator) *
  FROM table_name
  ORDER BY record_indicator --, other_optional_priority_field DESC
;

DELETE FROM table_name
WHERE record_indicator IN (
  SELECT record_indicator FROM mydups);

INSERT INTO table_name SELECT * FROM mydups;
COMMIT;

回答by Shivam Tyagi

Simple answer to this question:

这个问题的简单回答:

  1. Firstly create a temporary table from the main table where value of row_number=1.
  2. Secondly deleteall the rows from the main table on which we had duplicates.
  3. Then insert the values of temporary table into the main table.
  1. 首先从主表创建一个临时表,其中值为row_number=1.
  2. 其次delete是我们有重复的主表中的所有行。
  3. 然后将临时表的值插入到主表中。

Queries:

查询:

  1. Temporary table

    select id,date into #temp_a from (select *
    from (select a.*, row_number() over(partition by id order by etl_createdon desc) as rn from table a where a.id between 59 and 75 and a.date = '2018-05-24') where rn =1)a

  2. deleting all the rows from the main table.

    delete from table a where a.id between 59 and 75 and a.date = '2018-05-24'

  3. inserting all values from temp table to main table

    insert into table a select * from #temp_a.

  1. 临时表

    select id,date into #temp_a from (select *
    from (select a.*, row_number() over(partition by id order by etl_createdon desc) as rn from table a where a.id between 59 and 75 and a.date = '2018-05-24') where rn =1)a

  2. 从主表中删除所有行。

    delete from table a where a.id between 59 and 75 and a.date = '2018-05-24'

  3. 将临时表中的所有值插入到主表

    insert into table a select * from #temp_a.

回答by Phil Scalo

The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:

以下删除 'tablename' 中所有重复的记录,它不会对表进行重复数据删除:

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
          SELECT id,
          ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
          FROM tablename
         ) t
     WHERE t.rnum > 1);

Postgres administrative snippets

Postgres 管理片段

回答by Matthijs

Your query does not work because Redshift does not allow DELETEafter the WITHclause. Only SELECTand UPDATEand a few others are allowed (see WITH clause)

您的查询不起作用,因为 Redshift 不允许DELETEWITH子句之后。只允许SELECTandUPDATE和其他一些(见WITH 子句

Solution (in my situation):

解决方案(在我的情况下):

I did have an id column on my table eventsthat contained duplicate rows and uniquely identifies the record. This column idis the same as your record_indicator.

我的表上确实有一个 id 列,events其中包含重复的行并唯一标识记录。此列id与您的record_indicator.

Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:

不幸的是,我无法创建临时表,因为我在使用时遇到了以下错误SELECT DISTINCT

ERROR: Intermediate result row exceeds database block size

ERROR: Intermediate result row exceeds database block size

But this worked like a charm:

但这就像一个魅力:

CREATE TABLE temp as (
    SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber 
    FROM events
);

resulting in the temptable:

导致temp表:

id | rownumber | ...
----------------
1  | 1         | ...
1  | 2         | ...
2  | 1         | ...
2  | 2         | ...

Now the duplicates can be deleted by removing the rows having rownumberlarger than 1:

现在可以通过删除rownumber大于 1的行来删除重复项:

DELETE FROM temp WHERE rownumber > 1

After that rename the tables and your done.

之后重命名表格并完成。

回答by Jai

This method will preserve permissions and the table definition of the original_table

此方法将保留权限和表定义 original_table

  1. Create Table with unique rows
  1. 创建具有唯一行的表
CREATE TABLE unique_table as
(
   SELECT DISTINCT * FROM original_table
)
;
  1. Backup the original_table
  1. 备份 original_table
CREATE TABLE backup_table as
(
   SELECT * FROM original_table
)
; 
  1. Truncate the original_table
  1. 截断 original_table
TRUNCATE original_table
  1. Insert records from unique_tableinto original_table
  1. 插入从记录unique_tableoriginal_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;