SQL 如何在 Hive 中查找重复的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46748186/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:20:21  来源:igfitidea点击:

How to find duplicate rows in Hive?

sqlhive

提问by Shekhar

I want to find duplicate rows from one of the Hive table for which I was given two approaches.

我想从一个 Hive 表中找到重复的行,我得到了两种方法。

First approach is to use following two queries:

第一种方法是使用以下两个查询:

select count(*) from mytable; // this will give total row count 

second query is as below which will give count of distinct rows

第二个查询如下,它将给出不同行的计数

select count(distinct primary_key1, primary_key2) from mytable;

With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800rows are duplicate. But this query doesn't tell which rows are duplicated.

使用这种方法,对于我的一个表,使用第一个查询得出的总行数是 3500,而第二个查询给出的行数是 2700。所以它告诉我们3500 - 2700 = 800行是重复的。但是这个查询并没有说明哪些行是重复的。

My second approach to find duplicate is:

我查找重复项的第二种方法是:

select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;

Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.

上面的查询应该列出重复的行以及特定行重复的次数。但此查询显示零行,这意味着该表中没有重复的行。

So I would like to know:

所以我想知道:

  1. If my first approach is correct - if yes then how do I find which rows are duplicated
  2. Why second approach is not providing list of rows which are duplicated?
  3. Is there any other way to find the duplicates?
  1. 如果我的第一种方法是正确的 - 如果是,那么我如何找到重复的行
  2. 为什么第二种方法不提供重复的行列表?
  3. 有没有其他方法可以找到重复项?

回答by Alex

Hive does not validateprimary and foreign key constraints.

Hive 不验证主键和外键约束。

Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.

由于这些约束未经验证,上游系统需要在将数据加载到 Hive 之前确保数据完整性。

That means that Hive allows duplicates in Primary Keys.

这意味着 Hive 允许主键中有重复项。

To solve your issue, you should do something like this:

要解决您的问题,您应该执行以下操作:

select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;

This way you will get list of duplicated rows.

这样您将获得重复行的列表。

回答by rabkaman

analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)

分析窗口函数 row_number() 非常有用,可以根据 partition by 子句中指定的元素提供重复项。一个简单的内嵌视图和exists 子句将查明哪些相应的记录集包含原始表中的这些重复项。在某些数据库中(如 TD,您可以使用 QUALIFY pragma 选项放弃内联视图)

SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the

SQL1 和 SQL2 可以组合使用。SQL2:如果您想处理 NULL 而不是简单地关闭,那么在

SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )  
  FROM srcTable s

3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.

3) 查找所有记录,而不仅仅是 > 1 条记录。这提供了所有上下文数据以及键,因此在分析为什么有重复项而不仅仅是键时很有用。

select * from  srcTable s
where exists 
    ( select 1 from (
                SELECT  
                      keypart1,
                      keypart2,
                      row_number() over( partition by keypart1, keypart2 )  seq  
                FROM srcTable t 
                WHERE 
                  -- (whatever additional filtering you want) 
                ) t 
                where seq > 1 
                AND t.keypart1 = s.keypart1
                AND t.keypart2 = s.keypart2
    ) 

回答by Maneesh Bishnoi

Suppose your want get duplicate rows based on a particular column IDhere. Below query will give you all the IDs which are duplicate in table in hive.

假设您想根据ID此处的特定列获取重复的行。下面的查询将为您提供在 hive 表中重复的所有 ID。

SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1