MySQL 如何使用“group by”和“where”加速“select count(*)”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1031312/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 13:34:10  来源:igfitidea点击:

How to speed up "select count(*)" with "group by" and "where"?

mysqlperformanceindexingcountgroup-by

提问by ZA.

How to speed up select count(*)with group by?
It's too slow and is used very frequently.
I have a big trouble using select count(*)and group bywith a table having more than 3,000,000 rows.

如何加快select count(*)group by
它太慢了,而且使用频率很高。
我用很大的麻烦select count(*),并group by具有300多万行的表。

select object_title,count(*) as hot_num   
from  relations 
where relation_title='XXXX'   
group by object_title  

relation_title, object_titleis varchar. where relation_title='XXXX', which returns more than 1,000,000 rows, lead to the indexes on object_titlecould not work well.

关系标题对象标题是 varchar。 其中,relation_title='XXXX'返回超过 1,000,000 行,导致object_title上的索引 无法正常工作。

回答by Justin Grant

Here are several things I'd try, in order of increasing difficulty:

以下是我会尝试的几件事,按照难度增加的顺序:

(easier)- Make sure you have the right covering index

(更简单)- 确保你有正确的覆盖索引

CREATE INDEX ix_temp ON relations (relation_title, object_title);

This should maximize perf given your existing schema, since (unless your version of mySQL's optimizer is really dumb!) it will minimize the amount of I/Os needed to satisfy your query (unlike if the index is in the reverse order where the whole index must be scanned) and it will cover the query so you won't have to touch the clustered index.

考虑到您现有的架构,这应该最大化性能,因为(除非您的 mySQL 优化器版本真的很笨!)它将最小化满足您的查询所需的 I/O 数量(与如果索引的顺序相反,整个索引必须扫描),它将覆盖查询,因此您不必接触聚集索引。

(a little harder) - make sure your varchar fields are as small as possible

(有点难) - 确保你的 varchar 字段尽可能小

One of the perf challenges with varchar indexes on MySQL is that, when processing a query, the full declared size of the field will be pulled into RAM. So if you have a varchar(256) but are only using 4 chars, you're still paying the 256-byte RAM usage while the query is being processed. Ouch! So if you can shrink your varchar limits easily, this should speed up your queries.

MySQL 上 varchar 索引的性能挑战之一是,在处理查询时,字段的完整声明大小将被拉入 RAM。因此,如果您有一个 varchar(256) 但只使用 4 个字符,则在处理查询时您仍然需要支付 256 字节的 RAM 使用费。哎哟! 因此,如果您可以轻松缩小 varchar 限制,这应该会加快您的查询速度。

(harder) - Normalize

(更难) - 标准化

30% of your rows having a single string value is a clear cry for normalizing into another table so you're not duplicating strings millions of times. Consider normalizing into three tables and using integer IDs to join them.

30% 的行具有单个字符串值显然是要规范化到另一个表中,因此您不会将字符串复制数百万次。考虑规范化为三个表并使用整数 ID 来连接它们。

In some cases, you can normalize under the covers and hide the normalization with views which match the name of the current table... then you only need to make your INSERT/UPDATE/DELETE queries aware of the normalization but can leave your SELECTs alone.

在某些情况下,您可以在幕后规范化并使用与当前表的名称匹配的视图隐藏规范化......然后您只需要让您的 INSERT/UPDATE/DELETE 查询了解规范化,但可以不理会您的 SELECTs .

(hardest) - Hash your string columns and index the hashes

(最难) - 散列您的字符串列并索引散列

If normalizing means changing too much code, but you can change your schema a little bit, you may want to consider creating 128-bit hashes for your string columns (using the MD5 function). In this case (unlike normalization) you don't have to change all your queries, only the INSERTs and some of the SELECTs. Anyway, you'll want to hash your string fields, and then create an index on the hashes, e.g.

如果规范化意味着更改太多代码,但您可以稍微更改架构,那么您可能需要考虑为字符串列创建 128 位哈希(使用MD5 函数)。在这种情况下(与规范化不同),您不必更改所有查询,只需更改 INSERT 和一些 SELECT。无论如何,你会想要散列你的字符串字段,然后在散列上创建一个索引,例如

CREATE INDEX ix_temp ON relations (relation_title_hash, object_title_hash);

Note that you'll need to play around with the SELECT to make sure you are doing the computation via the hash index and not pulling in the clustered index (required to resolve the actual text value of object_title in order to satisfy the query).

请注意,您需要使用 SELECT 以确保您通过哈希索引进行计算而不是拉入聚集索引(需要解析 object_title 的实际文本值以满足查询)。

Also, if relation_title has a small varchar size but object title has a long size, then you can potentially hash only object_title and create the index on (relation_title, object_title_hash).

此外,如果relation_title 的varchar 大小较小,但对象title 的大小较长,则您可能仅对object_title 进行哈希处理并在 上创建索引(relation_title, object_title_hash)

Note that this solution only helps if one or both of these fields is very long relative to the size of the hashes.

请注意,此解决方案仅在这些字段中的一个或两个相对于散列的大小非常长时才有帮助。

Also note that there are interesting case-sensitivity/collation impacts from hashing, since the hash of a lowercase string is not the same as a hash of an uppercase one. So you'll need to make sure you apply canonicalization to the strings before hashing them-- in otherwords, only hash lowercase if you're in a case-insensitive DB. You also may want to trim spaces from the beginning or end, depending on how your DB handles leading/trailing spaces.

另请注意,散列对区分大小写/排序规则有一些有趣的影响,因为小写字符串的散列与大写字符串的散列不同。因此,您需要确保在对字符串进行散列之前对字符串应用规范化 - 换句话说,如果您在不区分大小写的数据库中,则只对小写进行散列。您可能还想从开头或结尾修剪空格,具体取决于您的数据库处理前导/尾随空格的方式。

回答by cheduardo

Indexing the columns in the GROUP BY clause would be the first thing to try, using a composite index. A query such as this can potentially be answered using only the index data, avoiding the need to scan the table at all. Since the records in the index are sorted, the DBMS should not need to perform a separate sort as part of the group processing. However, the index will slow down updates to the table, so be cautious with this if your table experiences heavy updates.

首先要尝试使用复合索引对 GROUP BY 子句中的列进行索引。像这样的查询可能只使用索引数据来回答,根本不需要扫描表。由于索引中的记录已排序,因此 DBMS 不应需要执行单独的排序作为组处理的一部分。但是,索引会减慢表的更新速度,因此如果您的表经历了大量更新,请谨慎处理。

If you use InnoDB for the table storage, the table's rows will be physically clustered by the primary key index. If that (or a leading portion of it) happens to match your GROUP BY key, that should speed up a query such as this because related records will be retrieved together. Again, this avoids having to perform a separate sort.

如果您使用 InnoDB 进行表存储,则表的行将按主键索引进行物理集群。如果那个(或它的前导部分)恰好与您的 GROUP BY 键匹配,那么应该会加快这样的查询,因为相关记录将被一起检索。同样,这避免了必须执行单独的排序。

In general, bitmap indexes would be another effective alternative, but MySQL does not currently support these, as far as I know.

一般来说,位图索引是另一种有效的替代方法,但据我所知,MySQL 目前不支持这些。

A materialized view would be another possible approach, but again this is not supported directly in MySQL. However, if you did not require the COUNT statistics to be completely up-to-date, you could periodically run a CREATE TABLE ... AS SELECT ...statement to manually cache the results. This is a bit ugly as it is not transparent, but may be acceptable in your case.

物化视图是另一种可能的方法,但同样在 MySQL 中不直接支持。但是,如果您不要求 COUNT 统计信息完全最新,则可以定期运行CREATE TABLE ... AS SELECT ...语句来手动缓存结果。这有点难看,因为它不透明,但在您的情况下可能是可以接受的。

You could also maintain a logical-level cache table using triggers. This table would have a column for each column in your GROUP BY clause, with a Count column for storing the number of rows for that particular grouping key value. Every time a row is added to or updated in the base table, insert or increment/decrement the counter row in the summary table for that particular grouping key. This may be better than the fake materialized view approach, as the cached summary will always be up-to-date, and each update is done incrementally and should have less of a resource impact. I think you would have to watch out for lock contention on the cache table, however.

您还可以使用触发器维护逻辑级缓存表。该表将为 GROUP BY 子句中的每一列提供一列,并带有一个 Count 列,用于存储该特定分组键值的行数。每次在基表中添加或更新行时,为该特定分组键在汇总表中插入或递增/递减计数器行。这可能比伪造的物化视图方法更好,因为缓存的摘要将始终是最新的,并且每次更新都是增量完成的,并且应该对资源的影响较小。但是,我认为您必须注意缓存表上的锁争用。

回答by Sorin Mocanu

If you have InnoDB, count(*) and any other aggregate function will do a table scan. I see a few solutions here:

如果您有 InnoDB,count(*) 和任何其他聚合函数将执行表扫描。我在这里看到一些解决方案:

  1. Use triggers and store aggregates in a separate table. Pros: integrity. Cons: slow updates
  2. Use processing queues. Pros: fast updates. Cons: old state can persist until the queue is processed so the user may feel a lack of integrity.
  3. Fully separate the storage access layer and store aggregates in a separate table. The storage layer will be aware of the data structure and can apply deltas instead of doing full counts. For example if you provide an "addObject" functionality within that you will know when an object has been added and thus the aggregate would be affected. Then you do only an update table set count = count + 1. Pros: fast updates, integrity (you may want to use a lock though in case several clients can alter the same record). Cons: you couple a bit of business logic and storage.
  1. 使用触发器并将聚合存储在单独的表中。优点:诚信。缺点:更新慢
  2. 使用处理队列。优点:更新快。缺点:旧状态可以持续到队列被处理,因此用户可能会觉得缺乏完整性。
  3. 将存储访问层完全分离,并将聚合存储在单独的表中。存储层将了解数据结构,并且可以应用增量而不是进行完整计数。例如,如果您在其中提供“addObject”功能,您将知道何时添加了对象,因此聚合会受到影响。然后你只做一个update table set count = count + 1. 优点:快速更新、完整性(您可能希望使用锁,但以防多个客户端可以更改同一记录)。缺点:您结合了一些业务逻辑和存储。

回答by Corey Ballou

I see that a few individuals have asked what engine you were using for the query. I would highly recommend you use MyISAM for the following reasions:

我看到有几个人问您使用什么引擎进行查询。我强烈建议您出于以下原因使用 MyISAM:

InnoDB- @Sorin Mocanu properly identified that you will do a full table scan regardless of indexes.

InnoDB- @Sorin Mocanu 正确地识别出无论索引如何,您都将进行全表扫描。

MyISAM- always keeps the current row count handy.

MyISAM- 始终保持当前行数方便。

Lastly, as @justin stated, make sure you have the proper covering index:

最后,正如@justin 所说,确保你有正确的覆盖索引:

CREATE INDEX ix_temp ON relations (relation_title, object_title);

回答by Mark Schultheiss

test count(myprimaryindexcolumn) and compare performance to your count(*)

测试计数(myprimaryindexcolumn)并将性能与您的计数(*)进行比较

回答by Haim Evgi

there is a point at which you truly need more RAM/CPUs/IO. You may have hit that for your hardware.

在某个时候,您确实需要更多 RAM/CPU/IO。您可能已经为您的硬件设置了这一点。

I will note that it usually isn't effective to use indexes (unless they are covering) for queries that hit more than 1-2% of the total rows in a table. If your large query is doing index seeks and bookmark lookups, it could be because of a cached plan that was from just a day-total query. Try adding in WITH (INDEX=0) to force a table scan and see if it is faster.

我会注意到,对于命中超过表中总行数 1-2% 的查询,使用索引(除非它们覆盖)通常是无效的。如果您的大型查询正在执行索引查找和书签查找,则可能是因为缓存计划仅来自一天的查询。尝试添加 WITH (INDEX=0) 以强制进行表扫描,看看它是否更快。

take this from : http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.sqlserver.programming&tid=4631bab4-0104-47aa-b548-e8428073b6e6&cat=&lang=&cr=&sloc=&p=1

取自:http: //www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.sqlserver.programming&tid=4631bab4-0104-47aa-b548-e8428073b6e6&cat=&lang=&cr= &sloc=&p=1

回答by Tim Büthe

If you what the size of the whole table, you should query the meta tables or info schema (that exist on every DBMS I know, but I'm not sure about MySQL). If your query is selective, you have to make sure there is an index for it.

如果你知道整个表的大小,你应该查询元表或信息模式(存在于我知道的每个 DBMS 上,但我不确定 MySQL)。如果您的查询是有选择性的,您必须确保它有一个索引。

AFAIK there is nothing more you can do.

AFAIK 你无能为力。

回答by SoftwareGeek

I would suggest to archive data unless there is any specific reason to keep it in the database or you could partition the data and run queries separately.

我建议存档数据,除非有任何特定原因将其保留在数据库中,或者您可以对数据进行分区并单独运行查询。