SQL 为什么多表连接会产生重复的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23786401/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 01:55:37  来源:igfitidea点击:

Why do multiple-table joins produce duplicate rows?

sqljoin

提问by sirdank

Let's say I have three tables A, B, and C. Each has two columns: a primary key and some other piece of data. They each have the same number of rows. If I JOINA and B on the primary key, I should end up with the same number of rows as are in either of them (as opposed to A.rows * B.rows).

假设我有三个表 A、B 和 C。每个表都有两列:一个主键和一些其他数据。它们每个都有相同的行数。如果我JOINA 和 B 在主键上,我应该得到与它们中任何一个相同的行数(而不是 A.rows * B.rows)。

Now, if I JOINA JOIN Bwith C, why do I end up with duplicate rows? I have run into this problem on several occasions and I do not understand it. It seems like it should produce the same result as JOINing Aand Bsince it has the same number of rows but, instead, duplicates are produced.

现在,如果我JOINA JOIN B使用C,为什么我最终会得到重复的行?我曾多次遇到这个问题,但我不明白。似乎它应该产生与JOINing相同的结果,A并且B因为它具有相同的行数,但相反,会产生重复项。

Queries that produce results like this are of the format

产生这样的结果的查询的格式是

SELECT *
FROM M
    INNER JOIN S
        on M.mIndex = S.mIndex
    INNER JOIN D
        ON M.platformId LIKE '%' + D.version + '%'
    INNER JOIN H
        ON D.Name = H.Name
        AND D.revision = H.revision

Here are schemas for the tables. H contains is a historic table containing everything that was ever in D. There are many M rows for each D and one S for each M.

以下是表的模式。H contains 是一个历史表,其中包含 D 中曾经存在的所有内容。 每个 D 有许多 M 行,每个 M 有一个 S。

Table M

表 M

    [mIndex] [int] NOT NULL PRIMARY KEY,
    [platformId] [nvarchar](256) NULL,
    [ip] [nvarchar](64) NULL,
    [complete] [bit] NOT NULL,
    [date] [datetime] NOT NULL,
    [DeployId] [int] NOT NULL PRIMARY KEY REFERENCES D.DeployId,
    [source] [nvarchar](64) NOT NULL PRIMARY KEY

Table S

表 S

[order] [int] NOT NULL PRIMARY KEY,
[name] [nvarchar](64) NOT NULL,
[parameters] [nvarchar](256) NOT NULL,
[Finished] [bit] NOT NULL,
[mIndex] [int] NOT NULL PRIMARY KEY,
[mDeployId] [int] NOT NULL PRIMARY KEY,
[Date] [datetime] NULL,
[status] [nvarchar](10) NULL,
[output] [nvarchar](max) NULL,
[config] [nvarchar](64) NOT NULL PRIMARY KEY

Table D

表 D

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[branch] [nvarchar](64) NOT NULL,
[revision] [int] NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](256) NOT NULL

Table H

表 H

[IdDeploy] [int] IDENTITY(1,1) NOT NULL,
[name] [nvarchar](64) NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](max) NOT NULL,
[StartDate] [datetime] NOT NULL,
[EndDate] [datetime] NULL,
[Revision] [nvarchar](64) NULL,

I didn't post the tables and query initially because I am more interested in understanding this problem for myself and avoiding it in the future.

我最初没有发布表格和查询,因为我更感兴趣的是自己了解这个问题并在将来避免它。

采纳答案by Joseph B

If one of the tables M, S, D, or Hhas more than one row for a given Id(if just the Idcolumn is not the Primary Key), then the query would result in "duplicate" rows. If you have more than one row for an Idin a table, then the other columns, which would uniquely identify a row, also must be included in the JOIN condition(s).

如果表M, S, D, or 中的一个表H具有多于一行的给定Id(如果该Id列不是主键),则查询将导致“重复”行。如果表中有多个行Id,则唯一标识行的其他列也必须包含在 JOIN 条件中。

References:

参考资料

Related Question on MSDN Forum

MSDN论坛上的相关问题

回答by HLGEM

When you have related tables you often have one-to-many or many-to-many relationships. So when you join to TableB each record in TableA many have multiple records in TableB. This is normal and expected.

当您有相关表时,您通常会有一对多或多对多的关系。因此,当您加入 TableB 时,TableA 中的每条记录在 TableB 中都有多个记录。这是正常和预期的。

Now at times you only need certain columns and those are all the same for all the records, then you would need to do some sort of group by or distinct to remove the duplicates. Let's look at an example:

现在有时您只需要某些列并且所有记录的列都是相同的,那么您需要进行某种分组或不同以删除重复项。让我们看一个例子:

TableA
Id Field1
1  test
2  another test

TableB
ID Field2 field3
1  Test1  something
1  test1  More something
2  Test2  Anything

So when you join them and select all the files you get:

因此,当您加入他们并选择您获得的所有文件时:

select * 
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1        b.id   b.field2  b.field3
1    test            1      Test1     something
1    test            1      Test1     More something
2    another test 2  2      Test2     Anything

These are not duplicates because the values of Field3 are different even though there are repeated values in the earlier fields. Now when you only select certain columns the same number of records are being joined together but since the columns with the different information is not being displayed they look like duplicates.

这些不是重复的,因为 Field3 的值不同,即使在较早的字段中有重复的值。现在,当您只选择某些列时,相同数量的记录被连接在一起,但由于没有显示具有不同信息的列,它们看起来像是重复的。

select a.Id, a.Field1,  b.field2
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1       b.field2  
1    test           Test1     
1    test           Test1 
2    another test   Test2

This appears to be duplicates but it is not because of the multiple records in TableB.

这似乎是重复的,但这不是因为 TableB 中有多个记录。

You normally fix this by using aggregates and group by, by using distinct or by filtering in the where clause to remove duplicates. How you solve this depends on exactly what your business rule is and how your database is designed and what kind of data is in there.

您通常通过使用聚合和分组依据、通过使用 distinct 或在 where 子句中过滤以删除重复项来解决此问题。你如何解决这个问题取决于你的业务规则是什么,你的数据库是如何设计的,以及那里有什么样的数据。

回答by Jonathan Josh

This might sound like a really basic "DUH" answer, but make sure that the column you're using to Lookup from on the merging file is actually full of unique values!

这可能听起来像是一个非常基本的“DUH”答案,但请确保您用于从合并文件中查找的列实际上充满了唯一值!

I noticed earlier today that PowerQuery won't throw you an error (like in PowerPivot) and will happily allow you to run a Many-Many merge. This will result in multiple rows being produced for each record that matches with a non-unique value.

我今天早些时候注意到 PowerQuery 不会向您抛出错误(就像在 PowerPivot 中一样),并且会很高兴地允许您运行多对多合并。这将导致为与非唯一值匹配的每个记录生成多行。

回答by CathalMF

Ok in this example you are getting duplicates because you are joining both D and S onto M. I assume you should be joining D.id onto S.id like below:

好的,在这个例子中,你得到了重复,因为你将 D 和 S 都加入到 M 上。我假设你应该像下面一样将 D.id 加入到 S.id 上:

SELECT *
FROM M
INNER JOIN S
    on M.Id = S.Id
INNER JOIN D
    ON S.Id = D.Id
INNER JOIN H
    ON D.Id = H.Id