SQL 比较 HIVE 中的两个表的相等性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31808672/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 03:53:09  来源:igfitidea点击:

Comparing two tables for equality in HIVE

sqljoinhiveleft-joinhiveql

提问by Danzo

I have two tables, table1 and table2. Each with the same columns:

我有两个表,table1 和 table2。每个都有相同的列:

key, c1, c2, c3

I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):

我想检查这些表是否彼此相等(它们具有相同的行)。到目前为止,我有这两个查询(<> = 在 HIVE 中不相等):

select count(*) from table1 t1 
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3

And

select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null

So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.

所以我的想法是,如果返回零计数,则表是相同的。但是,第一个查询的计数为零,第二个查询的计数为非零。它们究竟有何不同?如果有更好的方法来检查这个肯定让我知道。

回答by Klas Lindb?ck

The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.

第一个排除 t1.c1、t1.c2、t1.c3、t2.c1、t2.c2 或 t2.c3 为空的行。这意味着您有效地进行了内部联接。

The second one will find rows that exist in t1 but not in t2.

第二个将查找存在于 t1 但不在 t2 中的行。

To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:

要同时查找 t2 中存在但 t1 中不存在的行,您可以执行完整的外部联接。以下 SQL 假设所有列都是NOT NULL

select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
   or t2.key is null /* this condition matches rows that only exist in t1 */

回答by Gordon Linoff

If you want to check for duplicates andthe tables have exactly the same structure andthe tables do not have duplicates within them, then you can do:

如果要检查重复项并且表具有完全相同的结构并且表中没有重复项,则可以执行以下操作:

select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
      (select t2.*, 2 as which from table2 t2)
     ) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;

There are various ways that you can relax the conditions in the first paragraph, if necessary.

如有必要,您可以通过多种方式放松第一段中的条件。

Note that this version also works when the columns have NULLvalues. These might be causing the problem with your data.

请注意,当列有NULL值时,此版本也适用。这些可能会导致您的数据出现问题。

回答by Sourygna

I would recommend you not using any JOINs to try to compare tables:

我建议您不要使用任何 JOIN 来尝试比较表:

  • it is quite an expensive operations when tables are big (which is often the case in Hive)
  • it can give problems when some rows/"join keys" are repeated
  • 当表很大时,这是一项非常昂贵的操作(在 Hive 中通常是这种情况)
  • 当某些行/“连接键”重复时,它可能会出现问题

(and it can also be unpractical when data are in different clusters/datacenters/clouds).

(当数据位于不同的集群/数据中心/云中时,它也可能不切实际)。

Instead, I think using a checksum approach and comparing the checksums of both tables is best.

相反,我认为最好使用校验和方法并比较两个表的校验和。

I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:

我开发了一个 Python 脚本,可以让您轻松进行此类比较,并查看网络浏览器中的差异:

https://github.com/bolcom/hive_compared_bq

https://github.com/bolcom/hive_compared_bq

I hope that can help you!

我希望能帮到你!

回答by Sandeep Kumar

One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows

一种简单的解决方案是进行内连接。假设我们有两个 hive 表,即 table1 和 table2。两个表都有相同的列,即 col1、col2 和 col3。行数也应该相同。然后命令如下

**

**

select count(*) from table1 
inner join table2 
on  table1.col1 = table2.col1 
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;

**

**

If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.

如果输出值与 table1 和 table2 中的行数相同,则所有列都具有相同的值,但是如果输出计数小于某些数据不同。

回答by pkhabya

First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query

首先计算表 C1 和 C2 的计数。C1 和 C2 应该相等。C1 和 C2 可以从以下查询中获得

select count(*) from table1

if C1 and C2 are not equal, then the tables are not identical.

如果 C1 和 C2 不相等,则表不相同。

2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:

2:为表 DC1 和 DC2 找到不同的计数。DC1 和 DC2 应该相等。可以使用以下查询找到不同记录的数量:

select count(*) from (select distinct * from table1)

if DC1 and DC2 are not equal, the tables are not identical.

如果 DC1 和 DC2 不相等,则表不相同。

3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:

3:现在获取通过对 2 个表执行联合获得的记录数。设为 U。 使用以下查询获取 2 个表的联合中的记录数:

SELECT count (*)
 FROM 
    (SELECT *
    FROM table1
    UNION
    SELECT *
    FROM table2)

You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U

如果2个表的distinct count等于2个表合并得到的记录数,可以说2个表中的数据是相同的。即 DC1 = U 和 DC2 = U

回答by Randy

another variant

另一个变种

select c1-c2 "different row counts"
, c1-c3 "mismatched rows" 
from 
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
    where t1.key= t2.key
    and T1.c1=T2.c1 )

回答by YLG

Try with WITH Clause:

尝试使用 WITH 子句:

With cnt as(
   select count(*) cn1 from table1
   )
   select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2); 

回答by A Hocevar

Use a MINUSoperator:

使用MINUS运算符:

SELECT count(*) FROM
  (SELECT t1.c1, t1.c2, t1.c3 from table1 t1
    MINUS
  SELECT t2.c1, t2.c2, t2.c3 from table2 t2)