database SQLite 插入速度随着索引数量的增加而变慢

Question

提问by

Original question

原始问题

Background

背景

It is well-known that SQLite needs to be fine tunedto achieve insert speeds on the order of 50k inserts/s. There are many questions here regarding slow insert speeds and a wealth of advice and benchmarks.

众所周知，SQLite需要进行微调才能达到 50k 次插入/秒的插入速度。这里有很多关于缓慢插入速度和大量建议和基准的问题。

There are also claims that SQLite can handle large amounts of data, with reports of 50+ GB not causing any problems with the right settings.

也有人声称 SQLite 可以处理大量数据，50+ GB 的报告在正确设置下不会导致任何问题。

I have followed the advice here and elsewhere to achieve these speeds and I'm happy with 35k-45k inserts/s. The problem I have is that all of the benchmarks only demonstrate fast insert speeds with < 1m records. What I am seeing is that insert speed seems to be inversely proportional to table size.

我已经按照此处和其他地方的建议来实现这些速度，并且我对 35k-45k 插入/秒感到满意。我遇到的问题是所有的基准测试只展示了 < 1m 记录的快速插入速度。我看到的是插入速度似乎与表大小成反比。

Issue

问题

My use case requires storing 500m to 1b tuples ([x_id, y_id, z_id]) over a few years (1m rows / day) in a link table. The values are all integer IDs between 1 and 2,000,000. There is a single index on z_id.

我的用例需要[x_id, y_id, z_id]在链接表中存储 500m 到 1b 元组 ( ) 数年（100 万行/天）。这些值都是 1 到 2,000,000 之间的所有整数 ID。上有一个索引z_id。

Performance is great for the first 10m rows, ~35k inserts/s, but by the time the table has ~20m rows, performance starts to suffer. I'm now seeing about 100 inserts/s.

前 10m 行的性能很好，大约 35k 次插入/秒，但是当表有大约 20m 行时，性能开始受到影响。我现在看到大约 100 个插入/秒。

The size of the table is not particularly large. With 20m rows, the size on disk is around 500MB.

桌子的大小不是特别大。对于 20m 行，磁盘大小约为 500MB。

The project is written in Perl.

该项目是用 Perl 编写的。

Question

题

Is this the reality of large tables in SQLite or are there any secrets to maintaininghigh insert rates for tables with > 10m rows?

这是 SQLite 中大表的现实，还是有什么秘诀可以为超过 1000 万行的表保持高插入率？

Known workarounds which I'd like to avoid if possible

如果可能，我想避免的已知解决方法

Drop the index, add the records, and re-index: This is fine as a workaround, but doesn't work when the DB still needs to be usable during updates. It won't work to make the database completely inaccessible for xminutes / day
Break the table into smaller subtables / files: This will work in the short term and I have already experimented with it. The problem is that I need to be able to retrieve data from the entire history when querying which means that eventually I'll hit the 62 table attachment limit. Attaching, collecting results in a temp table, and detaching hundreds of times per request seems to be a lot of work and overhead, but I'll try it if there are no other alternatives.
SetSQLITE_FCNTL_CHUNK_SIZE: I don't know C (?!), so I'd prefer to not learn it just to get this done. I can't see any way to set this parameter using Perl though.

删除索引，添加记录，然后重新索引：作为一种解决方法，这很好，但当数据库在更新期间仍然需要可用时不起作用。使数据库完全无法访问x分钟/天是行不通的
将表分解为较小的子表/文件：这将在短期内起作用，我已经对其进行了试验。问题是我需要能够在查询时从整个历史记录中检索数据，这意味着最终我将达到 62 个表附件限制。在临时表中附加、收集结果以及每个请求分离数百次似乎需要大量的工作和开销，但如果没有其他选择，我会尝试一下。
SetSQLITE_FCNTL_CHUNK_SIZE：我不会 C (?!)，所以我宁愿不学习它只是为了完成这项工作。但是，我看不到使用 Perl 设置此参数的任何方法。

UPDATE

更新

Following Tim's suggestionthat an index was causing increasingly slow insert times despite SQLite's claims that it is capable of handling large data sets, I performed a benchmark comparison with the following settings:

按照Tim 的建议，尽管 SQLite 声称它能够处理大型数据集，但索引导致插入时间越来越慢，我对以下设置进行了基准比较：

inserted rows: 14 million
commit batch size: 50,000 records
cache_sizepragma: 10,000
page_sizepragma: 4,096
temp_storepragma: memory
journal_modepragma: delete
synchronouspragma: off

插入的行数：1400 万
提交批量大小：50,000 条记录
cache_size编译指示：10,000
page_size编译指示：4,096
temp_storepragma:记忆
journal_modepragma:删除
synchronous编译指示：关闭

In my project, as in the benchmark results below, a file-based temporary table is created and SQLite's built-in support for importing CSV data is used. The temporary table is then attached to the receiving database and sets of 50,000 rows are inserted with an insert-selectstatement. Therefore, the insert times do not reflect file to databaseinsert times, but rather table to tableinsert speed. Taking the CSV import time into account would reduce the speeds by 25-50% (a very rough estimate, it doesn't take long to import the CSV data).

在我的项目中，就像下面的基准测试结果一样，创建了一个基于文件的临时表，并使用了 SQLite 对导入 CSV 数据的内置支持。然后将临时表附加到接收数据库，并使用insert-select语句插入 50,000 行的集合。因此，插入次数不反映 文件到数据库的插入次数，而是表到表的插入速度。考虑到 CSV 导入时间会降低 25-50% 的速度（非常粗略的估计，导入 CSV 数据不需要很长时间）。

Clearly having an index causes the slowdown in insert speed as table size increases.

显然，随着表大小的增加，索引会导致插入速度变慢。

Plot of SQLite insert speed and table size

SQLite 插入速度和表大小图

It's quite clear from the data above that the correct answer can be assigned to Tim's answerrather than the assertions that SQLite just can't handle it. Clearly it canhandle large datasets ifindexing that dataset is not part of your use case. I have been using SQLite for just that, as a backend for a logging system, for a while now which does notneed to be indexed, so I was quite surprised at the slowdown I experienced.

从上面的数据可以清楚地看出，正确的答案可以分配给Tim 的答案，而不是 SQLite 无法处理的断言。显然，如果索引该数据集不是您的用例的一部分，它可以处理大型数据集。我一直在使用SQLite的只是，作为一个记录系统后端，有一段时间了，其也并不需要索引，所以我在我所经历的经济放缓很惊讶。

Conclusion

结论

If anyone finds themselves wanting to store a large amount of data using SQLite andhave it indexed, using shardsmay be the answer. I eventually settled on using the first three characters of an MD5 hash a unique column in zto determine assignment to one of 4,096 databases. Since my use case is primarily archival in nature, the schema will not change and queries will never require shard walking. There is a limit to database size since extremely old data will be reduced and eventually discarded, so this combination of sharding, pragma settings, and even some denormalisation gives me a nice balance that will, based on the benchmarking above, maintain an insert speed of at least 10k inserts / second.

如果有人发现自己想要使用 SQLite 存储大量数据并对其进行索引，那么使用分片可能是答案。我最终决定使用 MD5 的前三个字符散列一个唯一的列z来确定分配给 4,096 个数据库之一。由于我的用例本质上主要是归档，因此架构不会改变，查询也永远不需要分片遍历。还有就是数据库大小的限制，因为非常旧的数据将被降低，最终被废弃，所以分片，编译设置，甚至一些本组合德正常化给了我一个很好的平衡，意志，基于上述基准，保持插入速度至少 10k 次插入/秒。

Answer 1

采纳答案by Tim

If your requirement is to find a particular z_idand the x_idsand y_idslinked to it (as distinct from quickly selecting a range of z_ids) you could look into a non-indexed hash-table nested-relational db that would allow you to instantly find your way to a particular z_idin order to get its y_idsand x_ids-- without the indexing overhead and the concomitant degraded performance during inserts as the index grows. In order to avoid clumping (aka bucket collisions), choose a key hashing algorithm that puts greatest weight on the digits of z_idwith greatest variation (right-weighted).

如果您的要求是查找特定z_id的x_ids和y_ids链接到它的（与快速选择范围不同z_ids），您可以查看非索引哈希表嵌套关系数据库，它可以让您立即找到通往特别z_id是为了获得它y_ids和x_ids- 没有索引开销以及随着索引增长而在插入期间随之而来的性能下降。为了避免结块（又名桶冲突），请选择一种密钥散列算法，该算法将最大权重放在z_id变化最大的数字上（右加权）。

P.S. A database that uses a b-tree may at first appear faster than a db that uses linear hashing, say, but the insert performance will remain level with the linear hash as performance on the b-tree begins to degrade.

PS 使用 b-tree 的数据库最初可能比使用线性散列的 db 更快，但插入性能将与线性散列保持一致，因为 b-tree 上的性能开始下降。

P.P.S. To answer @kawing-chiu's question: the core feature relevant here is that such a database relies on so-called "sparse" tables in which the physical location of a record is determined by a hashing algorithm which takes the record key as input. This approach permits a seek directlyto the record's location in the table without the intermediary of an index. As there is no need to traverse indexes or to re-balance indexes, insert-times remain constant as the table becomes more densely populated. With a b-tree, by contrast, insert times degrade as the index tree grows. OLTP applications with large numbers of concurrent inserts can benefit from such a sparse-table approach. The records are scattered throughout the table. The downside of records being scattered across the "tundra" of the sparse table is that gathering large sets of records which have a value in common, such as a postal code, can be slower. The hashed sparse-table approach is optimized to insert and retrieve individual records, and to retrieve networksof related records, not large sets of records that have some field value in common.

PPS 回答@kawing-chiu 的问题：这里相关的核心特征是这样的数据库依赖于所谓的“稀疏”表，其中记录的物理位置由散列算法确定，该算法将记录键作为输入。这种方法允许直接查找表中记录的位置，而无需索引的中介. 由于不需要遍历索引或重新平衡索引，插入时间保持不变，因为表变得更加密集。相比之下，使用 b 树，插入时间会随着索引树的增长而降低。具有大量并发插入的 OLTP 应用程序可以从这种稀疏表方法中受益。记录分散在整个表格中。分散在稀疏表的“苔原”中的记录的缺点是收集具有共同值（例如邮政编码）的大量记录可能会更慢。散列稀疏表方法经过优化以插入和检索单个记录，并检索相关记录的网络，而不是具有某些共同字段值的大型记录集。

A nested relational database is one that permits tuples withina column of a row.

嵌套关系数据库允许在一行的列中包含元组。

Answer 2

回答by Alix Axel

Great question and very interesting follow-up!

很好的问题和非常有趣的后续行动！

I would just like to make a quick remark: You mentioned that breaking the table into smaller subtables / files and attaching them later is not an option because you'll quickly reach the hard limit of 62 attached databases. While this is completely true, I don't think you have considered a mid-way option: sharding the data into several tables butkeep using the same, single database (file).

我只想做一个简短的评论：您提到将表分解为较小的子表/文件并稍后附加它们不是一种选择，因为您将很快达到 62 个附加数据库的硬限制。虽然这是完全正确的，但我认为您没有考虑中途选择：将数据分片到多个表中，但继续使用相同的单个数据库（文件）。

I did a very crude benchmark just to make sure my suggestion really has an impact on performance.

我做了一个非常粗略的基准测试，以确保我的建议确实对性能有影响。

Schema:

架构：

CREATE TABLE IF NOT EXISTS "test_$i"
(
    "i" integer NOT NULL,
    "md5" text(32) NOT NULL
);

Data - 2 Million Rows:

数据 - 200 万行：

i= 1..2,000,000
md5= md5 hex digest of i

i= 1..2,000,000
md5= md5 十六进制摘要 i

Each transaction = 50,000 INSERTs.

每笔交易 = 50,000INSERT秒。

Databases: 1; Tables: 1; Indexes: 0

数据库：1个；表：1；索引：0

0..50000 records inserted in 1.87 seconds
50000..100000 records inserted in 1.92 seconds
100000..150000 records inserted in 1.97 seconds
150000..200000 records inserted in 1.99 seconds
200000..250000 records inserted in 2.19 seconds
250000..300000 records inserted in 1.94 seconds
300000..350000 records inserted in 1.94 seconds
350000..400000 records inserted in 1.94 seconds
400000..450000 records inserted in 1.94 seconds
450000..500000 records inserted in 2.50 seconds
500000..550000 records inserted in 1.94 seconds
550000..600000 records inserted in 1.94 seconds
600000..650000 records inserted in 1.93 seconds
650000..700000 records inserted in 1.94 seconds
700000..750000 records inserted in 1.94 seconds
750000..800000 records inserted in 1.94 seconds
800000..850000 records inserted in 1.93 seconds
850000..900000 records inserted in 1.95 seconds
900000..950000 records inserted in 1.94 seconds
950000..1000000 records inserted in 1.94 seconds
1000000..1050000 records inserted in 1.95 seconds
1050000..1100000 records inserted in 1.95 seconds
1100000..1150000 records inserted in 1.95 seconds
1150000..1200000 records inserted in 1.95 seconds
1200000..1250000 records inserted in 1.96 seconds
1250000..1300000 records inserted in 1.98 seconds
1300000..1350000 records inserted in 1.95 seconds
1350000..1400000 records inserted in 1.95 seconds
1400000..1450000 records inserted in 1.95 seconds
1450000..1500000 records inserted in 1.95 seconds
1500000..1550000 records inserted in 1.95 seconds
1550000..1600000 records inserted in 1.95 seconds
1600000..1650000 records inserted in 1.95 seconds
1650000..1700000 records inserted in 1.96 seconds
1700000..1750000 records inserted in 1.95 seconds
1750000..1800000 records inserted in 1.95 seconds
1800000..1850000 records inserted in 1.94 seconds
1850000..1900000 records inserted in 1.95 seconds
1900000..1950000 records inserted in 1.95 seconds
1950000..2000000 records inserted in 1.95 seconds

Database file size: 89.2 MiB.

数据库文件大小：89.2 MiB。

Databases: 1; Tables: 1; Indexes: 1 (`md5`)

数据库：1个；表：1；索引：1 ( `md5`)

0..50000 records inserted in 2.90 seconds
50000..100000 records inserted in 11.64 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.62 seconds
200000..250000 records inserted in 11.28 seconds
250000..300000 records inserted in 12.09 seconds
300000..350000 records inserted in 10.60 seconds
350000..400000 records inserted in 12.25 seconds
400000..450000 records inserted in 13.83 seconds
450000..500000 records inserted in 14.48 seconds
500000..550000 records inserted in 11.08 seconds
550000..600000 records inserted in 10.72 seconds
600000..650000 records inserted in 14.99 seconds
650000..700000 records inserted in 10.85 seconds
700000..750000 records inserted in 11.25 seconds
750000..800000 records inserted in 17.68 seconds
800000..850000 records inserted in 14.44 seconds
850000..900000 records inserted in 19.46 seconds
900000..950000 records inserted in 16.41 seconds
950000..1000000 records inserted in 22.41 seconds
1000000..1050000 records inserted in 24.68 seconds
1050000..1100000 records inserted in 28.12 seconds
1100000..1150000 records inserted in 26.85 seconds
1150000..1200000 records inserted in 28.57 seconds
1200000..1250000 records inserted in 29.17 seconds
1250000..1300000 records inserted in 36.99 seconds
1300000..1350000 records inserted in 30.66 seconds
1350000..1400000 records inserted in 32.06 seconds
1400000..1450000 records inserted in 33.14 seconds
1450000..1500000 records inserted in 47.74 seconds
1500000..1550000 records inserted in 34.51 seconds
1550000..1600000 records inserted in 39.16 seconds
1600000..1650000 records inserted in 37.69 seconds
1650000..1700000 records inserted in 37.82 seconds
1700000..1750000 records inserted in 41.43 seconds
1750000..1800000 records inserted in 49.58 seconds
1800000..1850000 records inserted in 44.08 seconds
1850000..1900000 records inserted in 57.17 seconds
1900000..1950000 records inserted in 50.04 seconds
1950000..2000000 records inserted in 42.15 seconds

Database file size: 181.1 MiB.

数据库文件大小：181.1 MiB。

Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (`md5`)

数据库：1个；表：20（每 100,000 条记录一个）；索引：1 ( `md5`)

0..50000 records inserted in 2.91 seconds
50000..100000 records inserted in 10.30 seconds
100000..150000 records inserted in 10.85 seconds
150000..200000 records inserted in 10.45 seconds
200000..250000 records inserted in 10.11 seconds
250000..300000 records inserted in 11.04 seconds
300000..350000 records inserted in 10.25 seconds
350000..400000 records inserted in 10.36 seconds
400000..450000 records inserted in 11.48 seconds
450000..500000 records inserted in 10.97 seconds
500000..550000 records inserted in 10.86 seconds
550000..600000 records inserted in 10.35 seconds
600000..650000 records inserted in 10.77 seconds
650000..700000 records inserted in 10.62 seconds
700000..750000 records inserted in 10.57 seconds
750000..800000 records inserted in 11.13 seconds
800000..850000 records inserted in 10.44 seconds
850000..900000 records inserted in 10.40 seconds
900000..950000 records inserted in 10.70 seconds
950000..1000000 records inserted in 10.53 seconds
1000000..1050000 records inserted in 10.98 seconds
1050000..1100000 records inserted in 11.56 seconds
1100000..1150000 records inserted in 10.66 seconds
1150000..1200000 records inserted in 10.38 seconds
1200000..1250000 records inserted in 10.24 seconds
1250000..1300000 records inserted in 10.80 seconds
1300000..1350000 records inserted in 10.85 seconds
1350000..1400000 records inserted in 10.46 seconds
1400000..1450000 records inserted in 10.25 seconds
1450000..1500000 records inserted in 10.98 seconds
1500000..1550000 records inserted in 10.15 seconds
1550000..1600000 records inserted in 11.81 seconds
1600000..1650000 records inserted in 10.80 seconds
1650000..1700000 records inserted in 11.06 seconds
1700000..1750000 records inserted in 10.24 seconds
1750000..1800000 records inserted in 10.57 seconds
1800000..1850000 records inserted in 11.54 seconds
1850000..1900000 records inserted in 10.80 seconds
1900000..1950000 records inserted in 11.07 seconds
1950000..2000000 records inserted in 13.27 seconds

Database file size: 180.1 MiB.

数据库文件大小：180.1 MiB。

As you can see, the insert speed remains pretty much constant if you shard the data into several tables.

如您所见，如果将数据分片到多个表中，插入速度几乎保持不变。

Answer 3

回答by Michael Pratt

Unfortunately I'd say this is a limitation of large tables in SQLite. It's not designedto operate on large-scale or large-volume datasets. While I understand it may drastically increase project complexity, you're probably better off researching more sophisticated database solutions appropriate to your needs.

不幸的是，我会说这是 SQLite 中大表的限制。它不是为在大规模或大容量数据集上运行而设计的。虽然我知道它可能会大大增加项目的复杂性，但您最好研究适合您需求的更复杂的数据库解决方案。

From everything you linked, it looks like table size to access speed is a direct tradeoff. Can't have both.

从您链接的所有内容来看，表大小与访问速度似乎是一个直接的权衡。不能两者兼得。

Answer 4

回答by Max

In my project, I couldn't shard the database, as it's indexed on different columns. To speed up the inserts, I've put the database during creation on /dev/shm (=linux ramdisk) and then copy it over to local disk. That obviously only works well for a write-once, read-many database.

在我的项目中，我无法对数据库进行分片，因为它在不同的列上建立了索引。为了加快插入速度，我在创建过程中将数据库放在 /dev/shm (=linux ramdisk) 上，然后将其复制到本地磁盘。这显然只适用于一次写入、多次读取的数据库。

Answer 5

回答by Clock ZHONG

I suspect the Index's hash value collision causes the insert speed slow.

我怀疑索引的哈希值冲突导致插入速度变慢。

When we have many many rows in one table, and then the indexed column hash value collision will happen more frequently. It means Sqlite engine needs to calculate the hash value two times or three times, or maybe even four times, in order to get a different hash value.

当我们在一张表中有很多行时，索引列的哈希值冲突会更频繁地发生。这意味着 Sqlite 引擎需要计算 2 次、3 次甚至 4 次哈希值，才能得到不同的哈希值。

So I guess this is the root cause of the SQLite insert slowness when the table has many rows.

所以我猜这是当表有很多行时 SQLite 插入缓慢的根本原因。

This point could explain why using shards could avoid this problem. Who's a real expert in SQLite domain to confirm or deny my point here?

这一点可以解释为什么使用分片可以避免这个问题。谁是 SQLite 领域的真正专家来确认或否认我的观点？

database SQLite 插入速度随着索引数量的增加而变慢

提问by

Original question

原始问题

Background

背景

Issue

问题

Question

题

Known workarounds which I'd like to avoid if possible

如果可能，我想避免的已知解决方法

UPDATE

更新

Conclusion

结论

采纳答案by Tim

回答by Alix Axel

Schema:

架构：

Data - 2 Million Rows:

数据 - 200 万行：

Databases: 1; Tables: 1; Indexes: 0

数据库：1个；表：1；索引：0

Databases: 1; Tables: 1; Indexes: 1 (`md5`)

数据库：1个；表：1；索引：1 ( `md5`)

Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (`md5`)

数据库：1个；表：20（每 100,000 条记录一个）；索引：1 ( `md5`)

回答by Michael Pratt

回答by Max

回答by Clock ZHONG

相关推荐

最近更新

标签

database SQLite 插入速度随着索引数量的增加而变慢

提问by

Original question

原始问题

Background

背景

Issue

问题

Question

题

Known workarounds which I'd like to avoid if possible

如果可能，我想避免的已知解决方法

UPDATE

更新

Conclusion

结论

采纳答案by Tim

回答by Alix Axel

Schema:

架构：

Data - 2 Million Rows:

数据 - 200 万行：

Databases: 1; Tables: 1; Indexes: 0

数据库：1个；表：1；索引：0

Databases: 1; Tables: 1; Indexes: 1 (md5)

数据库：1个；表：1；索引：1 ( md5)

Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (md5)

数据库：1个；表：20（每 100,000 条记录一个）；索引：1 ( md5)

回答by Michael Pratt

回答by Max

回答by Clock ZHONG

相关推荐

database 在 Doctrine 2 中指定十进制字段类型时，比例和精度是什么意思？

如何在 Excel VBA 电子邮件中添加签名？

Excel VBA：运行时错误“438”对象不支持此属性或方法

database 在 Firebase 中，有没有办法在不加载所有节点数据的情况下获取节点的子节点数？

相关推荐

最近更新

标签

Databases: 1; Tables: 1; Indexes: 1 (`md5`)

数据库：1个；表：1；索引：1 ( `md5`)

Databases: 1; Tables: 20 (one per 100,000 records); Indexes: 1 (`md5`)

数据库：1个；表：20（每 100,000 条记录一个）；索引：1 ( `md5`)