MySQL - 创建索引需要多长时间?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2483205/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MySQL - how long to create an index?
提问by xnx
Can anyone tell me how adding a key scales in MySQL? I have 500,000,000 rows in a database, trans, with columns i (INT UNSIGNED), j (INT UNSIGNED), nu (DOUBLE), A (DOUBLE). I try to index a column, e.g.
谁能告诉我如何在 MySQL 中添加一个键?我在数据库中有 500,000,000 行,trans,列 i (INT UNSIGNED), j (INT UNSIGNED), nu (DOUBLE), A (DOUBLE)。我尝试索引一列,例如
ALTER TABLE trans ADD KEY idx_A (A);
and I wait. For a table of 14,000,000 rows it took about 2 minutes to execute on my MacBook Pro, but for the whole half a billion, it's taking 15hrs and counting. Am I doing something wrong, or am I just being naive about how indexing a database scales with the number of rows?
我等待。对于一个包含 14,000,000 行的表,在我的 MacBook Pro 上执行大约需要 2 分钟,但对于整个 50 亿行,它需要 15 小时并且还在计数。我做错了什么,还是我只是对索引数据库如何随行数进行缩放感到天真?
采纳答案by Jonathan Leffler
There are a couple of factors to consider:
有几个因素需要考虑:
- Sorting is a N.log(N) operation.
- The sort for 14M rows might well fit in main memory; the sort with 500M rows probably doesn't, so the sort spills to disk, which slows things up enormously.
- 排序是一个 N.log(N) 操作。
- 14M 行的排序可能很适合主内存;500M 行的排序可能不会,所以排序会溢出到磁盘,这会大大减慢速度。
Since the factor is about 30 in size, the nominal sort time for the big data set would be of the order of 50 times as long - under two hours. However, you need 8 bytes per data value and about another 8 bytes of overhead (that's a guess - tune to mySQL if you know more about what it stores in an index). So, 14M × 16 ≈ 220 MB main memory. But 500M × 16 ≈ 8 GB main memory. Unless your machine has that much memory to spare (and MySQL is configured to use it), then the big sort is spilling to disk and that accounts for a lot of the rest of the time.
由于该因子的大小约为 30,因此大数据集的标称排序时间约为 50 倍——不到两小时。但是,您需要每个数据值 8 个字节和大约另外 8 个字节的开销(这是一个猜测 - 如果您更了解它在索引中存储的内容,请调整到 mySQL)。因此,14M × 16 ≈ 220 MB 主内存。但是 500M × 16 ≈ 8 GB 主存。除非你的机器有那么多内存可用(并且 MySQL 被配置为使用它),否则大排序会溢出到磁盘,并且会占用大量剩余时间。
回答by Andy
Firstly, your table definition could make a big difference here. If you don't need NULL
values in your columns, define them NOT NULL
. This will save space in the index, and presumably time while creating it.
首先,您的表定义在这里可能会产生很大的不同。如果您NULL
的列中不需要值,请定义它们NOT NULL
。这将节省索引中的空间,并且可能会节省创建它的时间。
CREATE TABLE x (
i INTEGER UNSIGNED NOT NULL,
j INTEGER UNSIGNED NOT NULL,
nu DOUBLE NOT NULL,
A DOUBLE NOT NULL
);
As for the time taken to create the indexes, this requires a table scan and will show up as REPAIR BY SORTING
. It should be quicker in your case (i.e. massive data set) to create a new table with the required indexes and insert the data into it, as this will avoid the REPAIR BY SORTING
operation as the indexes are built sequentially on the insert. There is a similar concept explained in this article.
至于创建索引所花费的时间,这需要表扫描并显示为REPAIR BY SORTING
. 在您的情况下(即海量数据集),创建具有所需索引的新表并将数据插入其中应该更快,因为这将避免REPAIR BY SORTING
操作,因为索引是在插入时按顺序构建的。这篇文章中解释了一个类似的概念。
CREATE DATABASE trans_clone;
CREATE TABLE trans_clone.trans LIKE originalDB.trans;
ALTER TABLE trans_clone.trans ADD KEY idx_A (A);
Then script the insert into chunks (as per the article), or dump the data using MYSQLDUMP
:
然后将插入脚本编写成块(根据文章),或使用MYSQLDUMP
以下命令转储数据:
mysqldump originalDB trans --extended-insert --skip-add-drop-table --no-create-db --no-create-info > originalDB .trans.sql
mysql trans_clone < originalDB .trans.sql
This will insert the data, but will not require an index rebuild (the index is built as each row is inserted) and should complete much faster.
这将插入数据,但不需要重建索引(索引是在插入每一行时构建的)并且应该更快地完成。
回答by Bjoern
From my experience: if the hardware can cope with it, indexing large tables with MySQL usually scales pretty linearly. I have tried it with tables of about 100,000,000 rows so far, but not on a notebook - mainly on strong servers.
根据我的经验:如果硬件可以应付,用 MySQL 索引大表通常可以线性扩展。到目前为止,我已经用大约 100,000,000 行的表尝试过它,但不是在笔记本上 - 主要是在强大的服务器上。
I guess it depends mainly on hardware factors, the kind of table engine you're using (MyIsam, INNO, or whatever) and a bit if the table is otherwise in use in between. When I was doing it, usually disk usage jumped sky high, unlike CPU usage. Not sure about the hard disks of the MacBook, but I guess they aren't the fastest around.
我想这主要取决于硬件因素、您使用的表引擎类型(MyIsam、INNO 或其他)以及表是否在两者之间以其他方式使用。当我这样做时,通常磁盘使用率会飙升,这与 CPU 使用率不同。不确定 MacBook 的硬盘,但我想它们不是最快的。
If you're having MyISAM tables, maybe have a closer look at the index files in the table directory and see how it changes over the time.
如果您有 MyISAM 表,不妨仔细查看表目录中的索引文件,看看它是如何随时间变化的。
回答by Laurent PELE
So theorically if Sorting step is a N.log(N) operation, partitionning your big table would save time on operation
所以理论上如果排序步骤是 N.log(N) 操作,对大表进行分区会节省操作时间
About 30 % gain for a table of 500 000 000 rows partitionned in 100 equal files : because 500 000 000* log(500 000 000)= 4 349 485 002 and 100 *(500 000 000/100*LOG(500 000 000/100)) = 3 349 485 002
对于 500 000 000 行分区在 100 个相等文件中的表,大约可以获得 30 % 的收益:因为 500 000 000* log(500 000 000)= 4 349 485 002 和 100 *(500 000 0*000/01 100)) = 3 349 485 002