MySQL 在用数据填充表之前还是在数据到位之后创建索引更好?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3688731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 17:08:11  来源:igfitidea点击:

Is it better to create an index before filling a table with data, or after the data is in place?

mysqldatabaseindexing

提问by Drew Stephens

I have a table of about 100M rows that I am going to copy to alter, adding an index. I'm not so concerned with the time it takes to create the new table, but will the created index be more efficient if I alter the table before inserting any data or insert the data first and then add the index?

我有一个大约 100M 行的表,我将复制它以进行更改,并添加一个索引。我不太关心创建新表所需的时间,但是如果我在插入任何数据之前更改表或先插入数据然后添加索引,创建的索引是否会更有效?

回答by valodzka

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it).

在数据插入后创建索引是更有效的方式(甚至经常建议在批量导入之前删除索引,导入后重新创建它)。

Syntetic example (PostgreSQL 9.1, slow development machine, one million rows):

Syntetic 示例(PostgreSQL 9.1,缓慢的开发机器,一百万行):

CREATE TABLE test1(id serial, x integer);
INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 7816.561 ms
CREATE INDEX test1_x ON test1 (x);
-- Time: 4183.614 ms

Insert and then create index - about 12 sec

插入然后创建索引 - 大约 12 秒

CREATE TABLE test2(id serial, x integer);
CREATE INDEX test2_x ON test2 (x);
-- Time: 2.315 ms
INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 25399.460 ms

Create index and then insert - about 25.5 sec (more than two times slower)

创建索引然后插入 - 大约 25.5 秒(慢两倍以上)

回答by Mark Wilkins

It is probably better to create the index after the rows are added. Not only will it be faster, but the tree balancing will probably be better.

在添加行之后创建索引可能更好。它不仅会更快,而且树平衡可能会更好。

Edit"balancing" probably is not the best choice of terms here. In the case of a b-tree, it is balanced by definition. But that does not mean that the b-tree has the optimal layout. Child node distribution within parents can be uneven (leading to more cost in future updates) and the tree depth can end up being deeper than necessary if the balancing is not performed carefully during updates. If the index is created after the rows are added, it is will more likely have a better distribution. In addition, index pages on disk may have less fragmentation after the index is built. A bit more information here

在这里编辑“平衡”可能不是最好的选择。在 b 树的情况下,它根据定义是平衡的。但这并不意味着 b-tree 具有最佳布局。父节点中的子节点分布可能不均匀(导致未来更新的成本更高),如果在更新过程中不仔细执行平衡,树的深度最终可能会比必要的更深。如果在添加行之后创建索引,它更有可能具有更好的分布。此外,在建立索引后,磁盘上的索引页可能会有较少的碎片。 这里有更多信息

回答by Svisstack

This doesn't matter on this problem because:

这在这个问题上无关紧要,因为:

  1. If you add data first to the table and after it you add index. Your index generating time will be O(n*log(N))longer (where nis a rows added). Because tree gerating time is O(N*log(N))then if you split this into old data and new data you get O((X+n)*log(N))this can be simply converted to O(X*log(N) + n*log(N))and in this format you can simply see what you will wait additional.
  2. If you add index and after it put data. Every row (you have nnew rows) you get longer insert additional time O(log(N))needed to regenerate structure of the tree after adding new element into it (index column from new row, because index already exists and new row was added then index must be regenerated to balanced structure, this cost O(log(P))where Pis a index power [elements in index]). You have nnew rows then finally you have n * O(log(N))then O(n*log(N))summary additional time.
  1. 如果先将数据添加到表中,然后再添加索引。您的索引生成时间会O(n*log(N))更长(n添加行的位置)。因为树生成时间就是O(N*log(N))如果你把它分成旧数据和新数据,你O((X+n)*log(N))可以简单地将它转换成O(X*log(N) + n*log(N))这种格式,你可以简单地看到你将额外等待什么。
  2. 如果添加索引并在它之后放置数据。每一行(你有n新行)你会得到更长的插入额外的时间O(log(N)),在添加新元素后重新生成树的结构(来自新行的索引列,因为索引已经存在并且添加了新行然后索引必须重新生成到平衡结构,此成本O(log(P))其中P是索引幂[索引中的元素])。你有n新行那么最后你有n * O(log(N))那么O(n*log(N))总结更多的时间。

回答by Mike Cross

Indexes created after are much faster in most cases. Case in point: 20 million rows with full text on varchar(255) - (Business Name) Index in place whilst importing rows - a match against taking up to 20 seconds in worst cases. Drop index and re-create - match against taking less than 1 second every time

在大多数情况下,之后创建的索引要快得多。举个例子:在 varchar(255) 上有全文的 2000 万行 - (商业名称)索引在导入行时就位 - 在最坏的情况下最多需要 20 秒。删除索引并重新创建 - 与每次少于 1 秒的时间相匹配

回答by GrandmasterB

I'm not sure it'll really matter for index efficiency's sake, since in both cases you are inserting new data into the index. The server wouldnt know how unbalanced an index would be until after its built, basically. Speed wise, obviously, do the inserts without the index.

我不确定这对索引效率是否真的很重要,因为在这两种情况下,您都在向索引中插入新数据。基本上,服务器不会知道索引的不平衡程度,直到它建立之后。显然,速度明智,在没有索引的情况下进行插入。