SQL：内部连接两个大型表

Question

提问by Jo?o Pereira

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

我有两个巨大的表，每个表都有大约 1 亿条记录，恐怕我需要在两者之间执行内部联接。现在，两个表都非常简单；这是描述：

BioEntity table:

生物实体表：

BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)

生物实体 ID (int)
名称（nvarchar 4000，虽然这有点矫枉过正）
类型 ID (int)

EGM table (an auxiliar table, in fact, resulting of bulk import operations):

EGM 表（一个辅助表，实际上是批量导入操作的结果）：

EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)

EMGId (int)
PID（整数）
名称（nvarchar 4000，虽然这有点矫枉过正）
类型 ID (int)
上次修改时间（日期）

I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

我需要获得一个匹配的名称，以便将 BioEntityId 与驻留在 EGM 表中的 PId 相关联。最初，我尝试使用单个内部联接完成所有操作，但查询似乎花费了太长时间，并且数据库的日志文件（在简单恢复模式下）设法占用了所有可用磁盘空间（仅超过 200 GB，当数据库占用 18GB 时），等待两天后查询将失败，如果我没记错的话。我设法阻止日志增长（现在只有 33 MB），但查询现在已经连续运行了 6 天，而且看起来不会很快停止。

I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

我在相当不错的计算机（4GB RAM、Core 2 Duo (E8400) 3GHz、Windows Server 2008、SQL Server 2008）上运行它，我注意到计算机每隔 30 秒（给予或接受）偶尔会卡住几秒钟。这使得它很难用于其他任何事情，这真的让我很紧张。

Now, here's the query:

现在，这是查询：

 SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
 FROM EGM INNER JOIN BioEntity 
 ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not endeither, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

我手动设置了一些索引；EGM 和 BioEntity 都有一个包含 TypeId 和 Name 的非聚集覆盖索引。但是，该查询运行了五天并且也没有结束，因此我尝试运行 Database Tuning Advisor 以使其正常工作。它建议删除我的旧索引并创建统计信息和两个聚集索引（每个表一个，只包含我觉得很奇怪的 TypeId - 或者只是简单的愚蠢 - 但我还是试了一下）。

It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

它已经运行了 6 天了，但我仍然不知道该怎么做...有什么想法吗？我怎样才能让它更快（或者至少是有限的）？

Update:- Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

更新：- 好的，我已经取消了查询并重新启动了服务器以使操作系统重新启动并再次运行 - 我正在重新运行工作流程并使用您提出的更改，特别是将 nvarchar 字段裁剪为更小的尺寸并交换“like”对于“=”。这至少需要两个小时，所以我稍后会发布进一步的更新

Update 2 (1PM GMT time, 18/11/09):- The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

更新 2（格林威治标准时间下午 1 点，09 年 11 月 18 日）：- 估计的执行计划显示 67% 的表扫描成本，然后是 33% 的哈希匹配。接下来是 0% 并行度（这不是很奇怪吗？这是我第一次使用估计的执行计划，但这个特殊的事实让我大吃一惊），0% 哈希匹配，更多 0% 并行度，0% 顶部，0 % 表插入，最后另一个 0% 选择插入。似乎索引很糟糕，正如预期的那样，所以我将制作手动索引并丢弃糟糕的建议。

Answer 1

采纳答案by Andomar

For huge joins, sometimes explicitly choosing a loop joinspeeds things up:

对于巨大的连接，有时明确地选择一个loop join可以加快速度：

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER LOOP JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

与往常一样，发布您的估计执行计划可以帮助我们提供更好的答案。

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

编辑：如果两个输入都已排序（它们应该是，带有覆盖索引），您可以尝试MERGE JOIN：

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)

Answer 2

回答by Larry Lustig

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

我不是 SQL 调优专家，但在我知道的任何数据库系统中，加入 VARCHAR 字段上的数亿行听起来都不是一个好主意。

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

您可以尝试向每个表添加一个整数列，并在 NAME 字段上计算散列，在引擎必须查看实际的 VARCHAR 数据之前，应该将可能的匹配项与合理的数字相匹配。

Answer 3

回答by RickNZ

First, 100M-row joins are not at all unreasonable or uncommon.

首先，100M 行的连接并不是不合理或不常见的。

However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.

但是，我怀疑您看到的性能不佳的原因可能与 INTO 子句有关。这样，您不仅要进行连接，还要将结果写入新表。您对越来越大的日志文件的观察基本上证实了这一点。

One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).

要尝试的一件事：移除 INTO 并查看它的性能。如果性能合理，那么为了解决慢速写入，您应该确保您的数据库日志文件位于与数据不同的物理卷上。如果不是，磁盘磁头在读取数据和写入日志时会抖动（大量搜索），并且您的性能会崩溃（可能只有其他情况的 1/40 到 1/60））。

Answer 4

回答by Arvo

Maybe a bit offtopic, but: " I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."

也许有点离题，但是：“我注意到计算机每隔 30 秒（给予或接受）偶尔会卡住几秒钟。”

This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.

这种行为是廉价 RAID5 阵列（或者可能是单个磁盘）在复制（并且您的查询主要复制数据）千兆字节信息时的特征。

More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.

有关问题的更多信息 - 您不能将查询划分为更小的块吗？比如以 A、B 等开头的名称或特定范围内的 ID？这可以大大减少事务/锁定开销。

Answer 5

回答by Jim B

I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.

我会尝试删除“LIKE”运算符；因为您似乎没有进行任何通配符匹配。

Answer 6

回答by Cade Roux

As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).

按照建议，我会散列名称以使连接更合理。如果可能的话，我会强烈考虑在导入批次期间通过查找来分配 id，因为这将消除稍后进行连接的需要（并且可能重复执行这种低效的连接）。

I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:

我看到你在 TypeID 上有这个索引 - 如果这是有选择性的，这将非常有帮助。此外，将具有名称哈希的列添加到同一索引：

SELECT EGM.Name
       ,BioEntity.BioEntityId
INTO AUX 
FROM EGM 
INNER JOIN BioEntity  
    ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
    AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
    AND EGM.name LIKE BioEntity.Name

Answer 7

回答by Wil P

Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.

我可能提供的另一个建议是尝试获取数据的一个子集，而不是一次处理所有 100 M 行来调整您的查询。这样您就不必花太多时间等待查询何时完成。然后您可以考虑检查查询执行计划，这也可能为手头的问题提供一些见解。

Answer 8

回答by Dave Swersky

100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.

1 亿条记录是巨大的。我会说要使用一个大的数据库，你需要一个专用的测试服务器。在执行此类查询的同时使用同一台机器做其他工作是不切实际的。

Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.

您的硬件功能相当强大，但要实现如此大的连接才能正常运行，您需要更强大的功能。具有 8GB 的四核系统将是一个好的开始。除此之外，您必须确保索引设置正确。

Answer 9

回答by DForck42

do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.

你有任何主键或索引吗？你可以分阶段选择吗？即名称如“A%”，名称如“B%”等。

Answer 10

回答by karlgrz

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

我手动设置了一些索引；EGM 和 BioEntity 都有一个包含 TypeId 和 Name 的非聚集覆盖索引。但是，该查询运行了五天并且也没有结束，因此我尝试运行 Database Tuning Advisor 以使其正常工作。它建议删除我的旧索引并创建统计信息和两个聚集索引（每个表一个，只包含我觉得很奇怪的 TypeId - 或者只是简单的愚蠢 - 但我还是试了一下）。

You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do notwant your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physicallysort your data in order of the clustered index on disk. You want non-clusteredindexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.

你说你在两个表中的 TypeId 上都做了一个聚集索引，尽管看起来你已经在每个表上有一个主键（分别是 BioEntityId 和 EGMId）。您不希望您的 TypeId 成为这些表上的聚集索引。您希望 BioEntityId 和 EGMId 被聚集（这将按照磁盘上聚集索引的顺序对您的数据进行物理排序。您希望将用于查找的外键上的非聚集索引。即 TypeId。尝试使主键聚集，并在仅包含 TypeId 的两个表上添加非聚集索引。

In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for eachforeign key should help out a lot with your performance.

在我们的环境中，我们有一个表，每个表大约有 10-2000 万条记录。我们做了很多类似于你的查询，我们在一两列上组合两个数据集。为每个外键添加一个索引应该对您的性能有很大帮助。

Please keep in mind that with 100 million records, those indexes are going to require a lotof disk space. However, it seems like performance is key here, so it should be worth it.

请记住，对于 1 亿条记录，这些索引将需要大量磁盘空间。然而，这里的性能似乎是关键，所以它应该是值得的。

K. Scott has a pretty good article herewhich explains some issues more in depth.

K. Scott 有一篇很好的文章，它更深入地解释了一些问题。

SQL：内部连接两个大型表

提问by Jo?o Pereira

采纳答案by Andomar

回答by Larry Lustig

回答by RickNZ

回答by Arvo

回答by Jim B

回答by Cade Roux

回答by Wil P

回答by Dave Swersky

回答by DForck42

回答by karlgrz

相关推荐

最近更新

标签

SQL：内部连接两个大型表

提问by Jo?o Pereira

采纳答案by Andomar

回答by Larry Lustig

回答by RickNZ

回答by Arvo

回答by Jim B

回答by Cade Roux

回答by Wil P

回答by Dave Swersky

回答by DForck42

回答by karlgrz

相关推荐

SQL 索引如何作用于视图？

带有两个值的 SQL/Informatica 中的 IIF 语句（表达式转换）

SQL where 子句中的函数调用

具有多列的 SQL Server 中的“In”子句

相关推荐

最近更新

标签