Mysql 在非常大的表上计算性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10976328/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Mysql count performance on very big tables
提问by hotips
I have a table with more than 100 millions rows in Innodb.
我在 Innodb 中有一个超过 1 亿行的表。
I have to know if there is more than 5000 rows where the foreign key = 1. I don't need the exact number.
我必须知道是否有超过 5000 行的外键 = 1。我不需要确切的数字。
I made some testing :
我做了一些测试:
SELECT COUNT(*) FROM table WHERE fk = 1
=> 16 secondsSELECT COUNT(*) FROM table WHERE fk = 1 LIMIT 5000
=> 16 secondsSELECT primary FROM table WHERE fk = 1
=> 0.6 seconds
SELECT COUNT(*) FROM table WHERE fk = 1
=> 16 秒SELECT COUNT(*) FROM table WHERE fk = 1 LIMIT 5000
=> 16 秒SELECT primary FROM table WHERE fk = 1
=> 0.6 秒
I will have a bigger network and treatment time but it can be an overload of 15.4 seconds !
我将拥有更大的网络和治疗时间,但可能会超载 15.4 秒!
Do you have a better idea ?
你有更好的主意吗?
Thanks
谢谢
Edit: [Added OP's relevant comments]
编辑:[添加 OP 的相关评论]
I tried SELECT SQL_NO_CACHE COUNT(fk) FROM table WHERE fk = 1 but it took 25 seconds
我试过 SELECT SQL_NO_CACHE COUNT(fk) FROM table WHERE fk = 1 但花了 25 秒
Mysql was tuned for Innodb with Mysql Tuner.
使用 Mysql Tuner 为 Innodb 调整了 Mysql。
CREATE TABLE table ( pk bigint(20) NOT NULL AUTO_INCREMENT,
fk tinyint(3) unsigned DEFAULT '0',
PRIMARY KEY (pk), KEY idx_fk (fk) USING BTREE )
ENGINE=InnoDB AUTO_INCREMENT=100380914 DEFAULT CHARSET=latin1
DB Stuff:
数据库的东西:
'have_innodb', 'YES' 'ignore_builtin_innodb', 'OFF' 'innodb_adaptive_hash_index', 'ON'
'innodb_additional_mem_pool_size', '20971520' 'innodb_autoextend_increment', '8'
'innodb_autoinc_lock_mode', '1' 'innodb_buffer_pool_size', '25769803776'
'innodb_checksums', 'ON' 'innodb_commit_concurrency', '0',
'innodb_concurrency_tickets', '500' 'innodb_data_file_path',
'ibdata1:10M:autoextend' 'innodb_data_home_dir', '', 'innodb_doublewrite', 'ON'
'innodb_fast_shutdown', '1' 'innodb_file_io_threads', '4'
'innodb_file_per_table', 'OFF', 'innodb_flush_log_at_trx_commit', '1'
'innodb_flush_method', '' 'innodb_force_recovery', '0' 'innodb_lock_wait_timeout', '50'
'innodb_locks_unsafe_for_binlog', 'OFF' 'innodb_log_buffer_size', '8388608'
'innodb_log_file_size', '26214400' 'innodb_log_files_in_group', '2'
'innodb_log_group_home_dir', './' 'innodb_max_dirty_pages_pct', '90'
'innodb_max_purge_lag', '0' 'innodb_mirrored_log_groups', '1' 'innodb_open_files',
'300' 'innodb_rollback_on_timeout', 'OFF' 'innodb_stats_on_metadata', 'ON'
'innodb_support_xa', 'ON' 'innodb_sync_spin_loops', '20' 'innodb_table_locks', 'ON'
'innodb_thread_concurrency', '8' 'innodb_thread_sleep_delay', '10000'
'innodb_use_legacy_cardinality_algorithm', 'ON'
Update '15:I used the same method up to now with 600 millions rows and 640 000 new rows per day. It's still working fine.
更新 '15:到目前为止,我每天使用相同的方法处理 6 亿行和 64 万行新行。它仍然工作正常。
采纳答案by hotips
Finally the fastest was to query the first X rows using C# and counting the rows number.
最后最快的是使用 C# 查询前 X 行并计算行数。
My application is treating the data in batches. The amount of time between two batches are depending the number of rows who need to be treated
我的应用程序正在批量处理数据。两批之间的时间长短取决于需要处理的行数
SELECT pk FROM table WHERE fk = 1 LIMIT X
I got the result in 0.9 seconds.
我在 0.9 秒内得到了结果。
Thanks all for your ideas!
感谢大家的想法!
回答by Salman A
You don't seem interested in the actual count so give this a try:
你似乎对实际计数不感兴趣,所以试试这个:
SELECT 1 FROM table WHERE fk = 1 LIMIT 5000, 1
If a row is returned, you have 5000 and more records. I presume the fk
column is indexed.
如果返回一行,则您有 5000 条或更多记录。我认为该fk
列已编入索引。
回答by scriptin
Counter tables or other caching mechanism is the solution:
计数器表或其他缓存机制是解决方案:
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 14.3.14.1, “InnoDB Performance Tuning Tips”.
InnoDB 不保留表中行的内部计数,因为并发事务可能同时“看到”不同数量的行。为了处理 SELECT COUNT(*) FROM t 语句,InnoDB 扫描表的索引,如果索引不完全在缓冲池中,这需要一些时间。如果您的表不经常更改,使用 MySQL 查询缓存是一个很好的解决方案。要获得快速计数,您必须使用您自己创建的计数器表,并让您的应用程序根据它所做的插入和删除来更新它。如果近似行数足够,则可以使用 SHOW TABLE STATUS。请参阅第 14.3.14.1 节,“InnoDB 性能调优技巧”。
回答by Rick James
I gotta add another Answer -- I have many corrections/additions to the comments and Answers so far.
我必须添加另一个答案 - 到目前为止,我对评论和答案有很多更正/补充。
For MyISAM, SELECT COUNT(*)
without WHERE
is dead-reckoned -- very fast. All other situations (include the InnoDB in the Question) must count through either the data's BTree or an index's BTree to get the answer. So we need to see how much to count through.
对于 MyISAM,SELECT COUNT(*)
没有WHERE
是死算的——非常快。所有其他情况(包括问题中的 InnoDB)必须通过数据的 BTree 或索引的 BTree 进行计数才能得到答案。所以我们需要看看要计算多少。
InnoDB caches data and index blocks (16KB each). But when the table's data or index BTree is bigger than innodb_buffer_pool_size
, you are guaranteed to hit the disk. Hitting the disk is almost always the slowest part of any SQL.
InnoDB 缓存数据和索引块(每个 16KB)。但是当表的数据或索引 BTree 大于 时innodb_buffer_pool_size
,你肯定会命中磁盘。命中磁盘几乎总是所有 SQL 中最慢的部分。
The Query Cache, when involved, usually results in query times of about 1 millisecond; this does not seem to be an issue with any of the timings quoted. So I won't dwell on it.
查询缓存,如果涉及,通常会导致大约 1 毫秒的查询时间;这似乎不是引用的任何时间的问题。所以我不会纠缠于此。
But... Runing the samequery twice in a rowwill often exhibit:
但是......连续两次运行相同的查询通常会表现出:
- First run: 10 seconds
- Second run: 1 second
- 首次运行:10 秒
- 第二次运行:1秒
This is symptomatic of the first run having to fetch most of the blocks from disk, while the second found it all in RAM (the buffer_pool). I suspect that some of the timings listed are bogus because of not realizing thiscaching issue. (16 sec vs 0.6 sec maybe explained by this.)
这是第一次运行必须从磁盘获取大部分块的症状,而第二次则在 RAM(buffer_pool)中找到所有块。我怀疑列出的一些时间是虚假的,因为没有意识到这个缓存问题。(16 秒 vs 0.6 秒可以用这个来解释。)
I will harp on "disk hits" or "blocks needed to be touched" as the realmetric of which SQL is faster.
我将强调“磁盘命中”或“需要触及的块”作为SQL 速度更快的真正指标。
COUNT(x)
checks x
for IS NOT NULL
before tallying. This adds a tiny amount of processing, but does not change the number of disk hits.
COUNT(x)
检查x
的IS NOT NULL
清点之前。这会增加少量的处理,但不会改变磁盘命中数。
The proffered table has a PK and a second column. I wonder if that is the realtable?? It makes a difference --
提供的表有一个 PK 和第二列。我想知道那是不是真的桌子??它有所作为——
- If the Optimizer decides to read the data-- that is, scan in
PRIMARY KEY
order -- it will be reading the data BTree, which is usually(but not in this lame example) much wider than secondary index BTrees. - If the Optimizer decides to read a secondary index (but not need to do a sort), there will be fewer blocks to touch. Hence, faster.
- 如果优化器决定读取数据——即按
PRIMARY KEY
顺序扫描——它将读取数据 BTree,它通常(但在这个蹩脚的例子中)比二级索引 BTree 宽得多。 - 如果优化器决定读取二级索引(但不需要进行排序),则可以接触的块会更少。因此,更快。
Comments on the original queries:
对原始查询的评论:
SELECT COUNT(*) FROM table WHERE fk = 1 => 16 seconds
-- INDEX(fk) is optimal, but see below
SELECT COUNT(*) FROM table WHERE fk = 1 LIMIT 5000 => 16 seconds
-- the LIMIT does nothing, since there is only one row in the result
SELECT primary FROM table WHERE fk = 1 => 0.6 seconds
-- Again INDEX(fk), but see below
WHERE fk = 1
begs for INDEX(fk, ...)
, preferably just INDEX(fk)
. Note that in InnoDB, each secondary index contains a copy of the pk. That is, INDEX(fk)
is effectively INDEX(fk, primary)
. Hence, the 3rd query can use that as "covering" and not need to touch the data.
WHERE fk = 1
乞求INDEX(fk, ...)
,最好只是INDEX(fk)
。请注意,在 InnoDB 中,每个二级索引都包含 pk 的副本。也就是说,INDEX(fk)
是有效的INDEX(fk, primary)
。因此,第三个查询可以将其用作“覆盖”并且不需要接触数据。
If the table is truly just the two columns then probablythe secondary index BTree will be fatter than the data BTree. But in realistic tables, the secondary index will be smaller. Hence an index scan will be faster (fewer blocks to touch) than a table scan.
如果表是真正公正的两列,然后可能辅助索引B树会胖比数据B树。但在现实的表中,二级索引会更小。因此,索引扫描将比表扫描更快(接触的块更少)。
The third query is also delivering a large resultset; this could cause the query to take a long time -- butit won't be included in the quoted "time"; it is network time, not query time.
第三个查询也提供了一个大的结果集;这可能会导致查询花费很长时间——但它不会包含在引用的“时间”中;是网络时间,不是查询时间。
innodb_buffer_pool_size = 25,769,803,776
I would guess that the table and its secondary index (from the FK) are each about 3-4GB. So, any timing might firsthave to load a lot of stuff. Then a secondrun would be entirely cached. (Of course, I don't know how many rows have fk=1
; presumably less than all the rows?)
innodb_buffer_pool_size = 25,769,803,776
我猜表和它的二级索引(来自 FK)每个大约 3-4GB。因此,任何时间都可能首先要加载很多东西。然后第二次运行将被完全缓存。(当然,我不知道有多少行fk=1
;大概比所有行都少?)
But... At 600M rows, the table and its index are eachapproaching the 25GB buffer_pool. So, the day may come soon that it becomes I/O bound -- this will make you wish to get back to 16 (or 25) seconds; yet you won't be able to. We can then talk about alternatives to doing the COUNT
.
但是......在600M行,表和它的指数是每个接近25GB BUFFER_POOL。因此,它成为 I/O 限制的那一天可能很快就会到来——这会让您希望回到 16(或 25)秒;但你将无法做到。然后我们可以讨论替代COUNT
.
SELECT 1 FROM tbl WHERE fk = 1 LIMIT 5000,1
-- Let's analyze this. It will scan the index, but it will stop after 5000 rows. Of all you need is "more than 5K", that is the best way to get it. It will be consistently fast (touching only a dozen blocks), regardless of total number of rows in the table. (It is still subject to buffer_pool_size and cache characteristics of the system. But a dozen blocks takes much less than a second, even with a cold cache.)
SELECT 1 FROM tbl WHERE fk = 1 LIMIT 5000,1
——我们来分析一下。它将扫描索引,但会在 5000 行后停止。您需要的只是“超过 5K”,这是获得它的最佳方式。无论表中的总行数如何,它都将始终快速(仅触及十几个块)。(它仍然受制于系统的 buffer_pool_size 和缓存特性。但是,即使使用冷缓存,十几个块也需要不到一秒钟的时间。)
MariaDB's LIMIT ROWS_EXAMINED
may be worth looking into. Without that, you could do
MariaDBLIMIT ROWS_EXAMINED
可能值得研究。没有那个,你可以做
SELECT COUNT(*) AS count_if_less_than_5K
FROM ( SELECT 1 FROM tbl WHERE fk = 1 LIMIT 5000 );
It maybe faster than delivering the rows to the client; it will have to collect the rows internally in a tmp table, but deliver only the COUNT
.
它可能比将行交付给客户端更快;它必须在内部收集 tmp 表中的行,但只提供COUNT
.
A side note: 640K rows inserted per day -- this approaches the limit for single-row INSERTs
in MySQL with your current settings on a HDD (not SDD). If you need to discuss the potential disaster, open another Question.
旁注:每天插入 640K 行——这接近了INSERTs
MySQL 中使用 HDD(不是 SDD)上的当前设置的单行限制。如果您需要讨论潜在的灾难,请打开另一个问题。
Bottom line:
底线:
- Be sure to avoid the Query cache. (by using
SQL_NO_CACHE
or turning the QC off) - Run any timing query twice; use the second time.
- Understand the structure and size of the BTree(s) involved.
- Don't use
COUNT(x)
unless you need the null check. - Do not use PHP's
mysql_*
interface; switch tomysqli_*
orPDO
.
- 一定要避免查询缓存。(通过使用
SQL_NO_CACHE
或关闭 QC) - 运行任何计时查询两次;第二次使用。
- 了解所涉及的 BTree 的结构和大小。
COUNT(x)
除非您需要空检查,否则不要使用。- 不要使用 PHP 的
mysql_*
接口;切换到mysqli_*
或PDO
。
回答by nischayn22
If you are using PHP you could do mysql_num_rows
on the result you got from SELECT primary FROM table WHERE fk = 1 => 0.6 seconds
, I think that will be efficient.
如果您使用的是 PHP,则可以mysql_num_rows
对从中获得的结果进行处理SELECT primary FROM table WHERE fk = 1 => 0.6 seconds
,我认为这会很有效。
But depends on what server-side language you are using
但取决于您使用的服务器端语言
回答by Gerardo Lima
If you're not interested to know the number of rows and you just want to test the COUNT against some value, you can use the standard script bellow:
如果您不想知道行数,而只想根据某个值测试 COUNT,则可以使用以下标准脚本:
SELECT 'X'
FROM mytable
WHERE myfield='A'
HAVING COUNT(*) >5
This will return one single row or no row at all, depending if condition is met.
这将返回单行或根本不返回行,具体取决于是否满足条件。
This script is ANSI compliant and can be fully run without evaluating the complete value of COUNT(*). If MySQL implemented optimization to stop evaluating rows after some condition is met (I really hope it does), then you'll get a performance improvement. Unfortunately I can't test this behavior myself because I don't have a big MySQL database available. If you do this test, please share the result here :)
此脚本符合 ANSI 标准,无需评估 COUNT(*) 的完整值即可完全运行。如果 MySQL 实施优化以在满足某些条件后停止评估行(我真的希望这样做),那么您将获得性能改进。不幸的是,我无法自己测试这种行为,因为我没有可用的大型 MySQL 数据库。如果您进行此测试,请在此处分享结果:)