SQL SQLite:在大表上计数很慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8988915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 14:07:49  来源:igfitidea点击:

SQLite: COUNT slow on big tables

sqldatabaseperformancesqlite

提问by Marc

I'm having a performance problem in SQLite with a SELECT COUNT(*) on a large tables.

我在 SQLite 中遇到性能问题,在大型表上使用 SELECT COUNT(*)。

As I didn't yet receive a usable answer and I did some further testing, I edited my question to incorporate my new findings.

由于我还没有收到可用的答案并且我做了一些进一步的测试,我编辑了我的问题以纳入我的新发现。

I have 2 tables:

我有2张桌子:

CREATE TABLE Table1 (
Key INTEGER NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key ASC))

CREATE Table2 (
Key INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
... a few other fields ...,
CONSTRAINT PK_Table2 PRIMARY KEY (Key ASC, Key2 ASC))

Table1 has around 8 million records and Table2 has around 51 million records, and the databasefile is over 5GB.

Table1大约有800万条记录,Table2大约有5100万条记录,数据库文件超过5GB。

Table1 has 2 more indexes:

表 1 还有 2 个索引:

CREATE INDEX IDX_Table1_Status ON Table1 (Status ASC, Key ASC)
CREATE INDEX IDX_Table1_Selection ON Table1 (Selection ASC, Key ASC)

"Status" is required field, but has only 6 distinct values, "Selection" is not required and has only around 1.5 million values different from null and only around 600k distinct values.

“状态”是必填字段,但只有 6 个不同的值,“选择”不是必需的,并且只有大约 150 万个与 null 不同的值,并且只有大约 60 万个不同的值。

I did some tests on both tables, you can see the timings below, and I added the "explain query plan" for each request (QP). I placed the database file on an USB-memorystick so i could remove it after each test and get reliable results without interference of the disk cache. Some requests are faster on USB (I suppose due to lack of seektime), but some are slower (table scans).

我对两个表都做了一些测试,你可以看到下面的时间安排,我为每个请求(QP)添加了“解释查询计划”。我将数据库文件放在 USB 记忆棒上,这样我就可以在每次测试后将其删除,并在不干扰磁盘缓存的情况下获得可靠的结果。USB 上的某些请求速度更快(我想是由于缺少寻道时间),但有些请求速度较慢(表扫描)。

SELECT COUNT(*) FROM Table1
    Time: 105 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 153 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Key = 5123456
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 16 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
SELECT * FROM Table1 WHERE Selection = 'SomeValue' AND Key > 5123456 LIMIT 1
    Time: 9 ms
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Selection (Selection=?) (~3 rows)

As you can see the counts are very slow, but normal selects are fast (except for the 2nd one, which took 16 seconds).

如您所见,计数非常慢,但正常选择很快(除了第二个,它用了 16 秒)。

The same goes for Table2:

表 2 也是如此:

SELECT COUNT(*) FROM Table2
    Time: 528 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~1000000 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table2 WHERE Key = 5123456 AND Key2 = 0
    Time: 7 ms
    QP: SEARCH TABLE Table2 USING INDEX sqlite_autoindex_Table2_1 (Key=? AND Key2=?) (~1 rows)

Why is SQLite not using the automatically created index on the primary key on Table1 ? And why, when he uses the auto-index on Table2, it still takes a lot of time ?

为什么 SQLite 不在 Table1 的主键上使用自动创建的索引?为什么,当他在 Table2 上使用自动索引时,仍然需要很多时间?

I created the same tables with the same content and indexes on SQL Server 2008 R2 and there the counts are nearly instantaneous.

我在 SQL Server 2008 R2 上创建了具有相同内容和索引的相同表,并且计数几乎是即时的。

One of the comments below suggested executing ANALYZE on the database. I did and it took 11 minutes to complete. After that, I ran some of the tests again:

下面的评论之一建议对数据库执行 ANALYZE。我做了,花了 11 分钟才完成。之后,我再次运行了一些测试:

SELECT COUNT(*) FROM Table1
    Time: 104 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~7848023 rows)
SELECT COUNT(Key) FROM Table1
    Time: 151 sec
    QP: SCAN TABLE Table1 (~7848023 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid>?) (~196200 rows)
SELECT COUNT(*) FROM Table2
    Time: 529 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~51152542 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~51152542 rows)

As you can see, the queries took the same time (except the query plan is now showing the real number of rows), only the slower select is now also fast.

如您所见,查询花费了相同的时间(除了查询计划现在显示实际行数),只有较慢的选择现在也很快。

Next, I create dan extra index on the Key field of Table1, which should correspond to the auto-index. I did this on the original database, without the ANALYZE data. It took over 23 minutes to create this index (remember, this is on an USB-stick).

接下来,我在 Table1 的 Key 字段上创建 dan extra 索引,它应该对应于自动索引。我在原始数据库上做了这个,没有 ANALYZE 数据。创建这个索引花了超过 23 分钟(记住,这是在 U 盘上)。

CREATE INDEX IDX_Table1_Key ON Table1 (Key ASC)

Then I ran the tests again:

然后我再次运行测试:

SELECT COUNT(*) FROM Table1
    Time: 4 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Key(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 167 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 17 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

As you can see, the index helped with the count(*), but not with the count(Key).

如您所见,索引对 count(*) 有帮助,但对 count(Key) 没有帮助。

Finaly, I created the table using a column constraint instead of a table constraint:

最后,我使用列约束而不是表约束创建了表:

CREATE TABLE Table1 (
Key INTEGER PRIMARY KEY ASC NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL)

Then I ran the tests again:

然后我再次运行测试:

SELECT COUNT(*) FROM Table1
    Time: 6 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 28 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 10 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

Although the query plans are the same, the times are a lot better. Why is this ?

虽然查询计划相同,但时间要好很多。为什么是这样 ?

The problem is that ALTER TABLE does not permit to convert an existing table and I have a lot of existing databases which i can not convert to this form. Besides, using a column contraint instead of table constraint won't work for Table2.

问题是 ALTER TABLE 不允许转换现有表,而且我有很多现有数据库,我无法转换为这种形式。此外,使用列约束代替表约束对 Table2 不起作用。

Has anyone any idea what I am doing wrong and how to solve this problem ?

有谁知道我做错了什么以及如何解决这个问题?

I used System.Data.SQLite version 1.0.74.0 to create the tables and to run the tests I used SQLiteSpy 1.9.1.

我使用 System.Data.SQLite 1.0.74.0 版来创建表并运行我使用 SQLiteSpy 1.9.1 的测试。

Thanks,

谢谢,

Marc

马克

回答by Alix Axel

If you haven't DELETEd any records, doing:

如果您没有DELETE任何记录,请执行以下操作:

SELECT MAX(_ROWID_) FROM "table" LIMIT 1;

Will avoid the full-table scan. Note that _ROWID_is a SQLite identifier.

将避免全表扫描。请注意,这_ROWID_是一个 SQLite 标识符

回答by Alix Axel

From http://old.nabble.com/count(*)-slow-td869876.html

来自http://old.nabble.com/count(*)-slow-td869876.html

SQLite always does a full table scan for count(*). It does not keep meta information on tables to speed this process up.

SQLite 总是对 count(*) 进行全表扫描。它不会在表上保留元信息以加快此过程。

Not keeping meta information is a deliberate design decision. If each table stored a count (or better, each node of the B-tree stored a count) then much more updating would have to occur on every INSERT or DELETE. This would slow down INSERT and DELETE, even in the common case where count(*) speed is unimportant.

不保留元信息是一个深思熟虑的设计决定。如果每个表都存储了一个计数(或者更好的是,B 树的每个节点都存储了一个计数),那么每次插入或删除时都必须进行更多的更新。即使在 count(*) 速度不重要的常见情况下,这也会减慢 INSERT 和 DELETE 的速度。

If you really need a fast COUNT, then you can create a trigger on INSERT and DELETE that updates a running count in a separate table then query that separate table to find the latest count.

如果您确实需要快速计数,那么您可以在 INSERT 和 DELETE 上创建一个触发器,更新单独表中的运行计数,然后查询该单独表以查找最新计数。

Of course, it's not worth keeping a FULL row count if you need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).

当然,如果您需要依赖于 WHERE 子句的 COUNT(即 WHERE field1 > 0 和 field2 < 1000000000),保持完整的行数是不值得的。

回答by Thinkeye

Do not count the stars, count the records! Or in other language, never issue

不要数星星,数记录!或用其他语言,从不发出

SELECT COUNT(*) FROM tablename;

SELECT COUNT(*) FROM tablename;

use

SELECT COUNT(ROWID) FROM tablename;

SELECT COUNT(ROWID) FROM tablename;

Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.

为两者调用 EXPLAIN QUERY PLAN 以查看差异。确保您有一个包含 WHERE 子句中提到的所有列的索引。

回答by Stephen Jennings

This may not help much, but you can run the ANALYZEcommand to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.

这可能没有多大帮助,但您可以运行ANALYZE命令来重建有关您的数据库的统计信息。尝试运行“ ANALYZE;”来重建关于整个数据库的统计信息,然后再次运行您的查询,看看它是否更快。

回答by Donal Fellows

On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEYto the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)

关于列约束,SQLite 将声明为INTEGER PRIMARY KEY内部行 id的列映射到内部行 id(这又允许许多内部优化)。从理论上讲,它可以对单独声明的主键约束做同样的事情,但在实践中似乎不会这样做,至少在使用的 SQLite 版本中是这样。(System.Data.SQLite 1.0.74.0 对应于核心 SQLite 3.7.7.1。您可能想尝试使用 1.0.79.0 重新检查您的数字;您不需要更改数据库来执行此操作,只需更改库即可。)

回答by Matt T

The output for the fast queries all start with the text "QP: SEARCH". Whilst those for the slow queries start with text "QP: SCAN", which suggests that sqlite is performing a scan of the entire table in order to generate the count.

快速查询的输出都以文本“QP:SEARCH”开头。而那些慢查询以文本“QP:SCAN”开头,这表明sqlite正在执行整个表的扫描以生成计数。

Googling for "sqlite table scan count" finds the following, which suggests that using a full table scan to retrieve a count is just the way sqlite works, and is therefore probably unavoidable.

谷歌搜索“sqlite table scan count”发现以下内容,这表明使用全表扫描来检索计数正是sqlite的工作方式,因此可能是不可避免的。

As a workaround, and given that status has only eight values, I wondered if you could get a count quickly using a query like the following?

作为一种解决方法,并且鉴于该状态只有八个值,我想知道您是否可以使用如下查询快速获取计数?

select 1 where status=1 union select 1 where status=2 ...

select 1 where status=1 union select 1 where status=2 ...

then count the rows in the result. This is clearly ugly, but it might work if it persuades sqlite to run the query as a search rather than a scan. The idea of returning "1" each time is to avoid the overhead of returning real data.

然后计算结果中的行数。这显然很难看,但如果它说服 sqlite 将查询作为搜索而不是扫描运行,它可能会起作用。每次返回“1”的想法是为了避免返回真实数据的开销。

回答by user1214836

Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.

这是提高查询性能的潜在解决方法。从上下文来看,听起来您的查询需要大约一分半钟才能运行。

Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).

假设你有一个 date_created 列(或者可以添加一个),每天午夜(比如在 00:05am)在后台运行一个查询并将值与计算的 last_updated 日期一起保存在某个地方(我会回来一点点)。

Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".

然后,针对您的 date_created 列(带有索引)运行,您可以通过执行诸如 SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05" 之类的查询来避免全表扫描。

Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.

将来自该查询的计数值添加到您的持久值中,您将获得一个相当快且通常准确的计数。

The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).

唯一的问题是,从上午 12:05 到上午 12:07(运行总计数查询的持续时间),您有一个竞争条件,您可以检查全表扫描 count() 的 last_updated 值。如果它已经超过 24 小时,那么您的增量计数查询需要提取一整天的计数加上今天经过的时间。如果它小于 24 小时,那么您的增量计数查询需要提取不完整的一天的计数(只是今天过去的时间)。

回答by Vitalii

I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

我遇到了同样的问题,在我的情况下 VACUUM 命令有帮助。在数据库 COUNT(*) 上执行后,速度提高了近 100 倍。但是,命令本身在我的数据库中需要几分钟(2000 万条记录)。当我的软件在主窗口破坏后退出时,我通过运行 VACUUM 解决了这个问题,因此延迟不会给用户带来问题。