SQL SQLite 可以处理 9000 万条记录吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3160987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can SQLite handle 90 million records?
提问by Brian O'Kennedy
Or should I use a different hammer to fix this problem.
或者我应该使用不同的锤子来解决这个问题。
I've got a very simple use-case for storing data, effectively a sparse matrix, which I've attempted to store in a SQLite database. I've created a table:
我有一个非常简单的用于存储数据的用例,实际上是一个稀疏矩阵,我试图将其存储在 SQLite 数据库中。我创建了一个表:
create TABLE data ( id1 INTEGER KEY, timet INTEGER KEY, value REAL )
into which I insert a lot of data, (800 elements every 10 minutes, 45 times a day), most days of the year. The tuple of (id1,timet) will always be unique.
我在其中插入了大量数据(每 10 分钟 800 个元素,每天 45 次),一年中的大部分时间。(id1,timet) 的元组将始终是唯一的。
The timet value is seconds since the epoch, and will always be increasing. The id1 is, for all practical purposes, a random integer. There is probably only 20000 unique ids though.
时间值是自纪元以来的秒数,并且将始终增加。出于所有实际目的,id1 是一个随机整数。虽然可能只有 20000 个唯一 ID。
I'd then like to access all values where id1==someid or access all elements where timet==sometime. On my tests using the latest SQLite via the C interface on Linux, a lookup for one of these (or any variant of this lookup) takes approximately 30 seconds, which is not fast enough for my use case.
然后我想访问 id1==someid 的所有值或访问 timet==sometime 的所有元素。在我通过 Linux 上的 C 接口使用最新 SQLite 的测试中,查找其中一个(或此查找的任何变体)大约需要 30 秒,这对于我的用例来说不够快。
I tried defining an index for the database, but this slowed down insertion to completely unworkable speeds (I might have done this incorrectly though...)
我尝试为数据库定义一个索引,但这使插入速度减慢到完全无法工作的速度(虽然我可能做错了......)
The table above leads to very slow access for any data. My question is:
上表导致对任何数据的访问都非常缓慢。我的问题是:
- Is SQLite completely the wrong tool for this?
- Can I define indices to speed things up significantly?
- Should I be using something like HDF5 instead of SQL for this?
- SQLite 是完全错误的工具吗?
- 我可以定义索引以显着加快速度吗?
- 我应该为此使用 HDF5 之类的东西而不是 SQL 吗?
Please excuse my very basic understanding of SQL!
请原谅我对 SQL 的基本理解!
Thanks
谢谢
I include a code sample that shows how the insertion speed slows to a crawl when using indices. With the 'create index' statements in place, the code takes 19 minutes to complete. Without that, it runs in 18 seconds.
我包含了一个代码示例,它显示了在使用索引时插入速度如何减慢到爬行。使用“创建索引”语句,代码需要 19 分钟才能完成。没有它,它会在 18 秒内运行。
#include <iostream>
#include <sqlite3.h>
void checkdbres( int res, int expected, const std::string msg )
{
if (res != expected) { std::cerr << msg << std::endl; exit(1); }
}
int main(int argc, char **argv)
{
const size_t nRecords = 800*45*30;
sqlite3 *dbhandle = NULL;
sqlite3_stmt *pStmt = NULL;
char statement[512];
checkdbres( sqlite3_open("/tmp/junk.db", &dbhandle ), SQLITE_OK, "Failed to open db");
checkdbres( sqlite3_prepare_v2( dbhandle, "create table if not exists data ( issueid INTEGER KEY, time INTEGER KEY, value REAL);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
checkdbres( sqlite3_prepare_v2( dbhandle, "create index issueidindex on data (issueid );", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
checkdbres( sqlite3_prepare_v2( dbhandle, "create index timeindex on data (time);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
for ( size_t idx=0; idx < nRecords; ++idx)
{
if (idx%800==0)
{
checkdbres( sqlite3_prepare_v2( dbhandle, "BEGIN TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to begin transaction");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute begin transaction" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize begin transaction");
std::cout << "idx " << idx << " of " << nRecords << std::endl;
}
const size_t time = idx/800;
const size_t issueid = idx % 800;
const float value = static_cast<float>(rand()) / RAND_MAX;
sprintf( statement, "insert into data values (%d,%d,%f);", issueid, (int)time, value );
checkdbres( sqlite3_prepare_v2( dbhandle, statement, -1, &pStmt, NULL ), SQLITE_OK, "Failed to build statement");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
if (idx%800==799)
{
checkdbres( sqlite3_prepare_v2( dbhandle, "END TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to end transaction");
checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute end transaction" );
checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize end transaction");
}
}
checkdbres( sqlite3_close( dbhandle ), SQLITE_OK, "Failed to close db" );
}
采纳答案by Robert Harvey
Are you inserting all of the 800 elements at once? If you are, doing the inserts within a transaction will speed up the process dramatically.
您是否一次插入所有 800 个元素?如果是,在事务中进行插入将大大加快进程。
See http://www.sqlite.org/faq.html#q19
见http://www.sqlite.org/faq.html#q19
SQLite can handle very large databases. See http://www.sqlite.org/limits.html
SQLite 可以处理非常大的数据库。见http://www.sqlite.org/limits.html
回答by Robert Harvey
I've looked at your code, and I think you might be overdoing it with the prepare
and finalize
statements. I am by no means an SQLite expert, but there's got to be significant overhead in preparing a statement each and every time through the loop.
我看过你的代码,我认为你可能用prepare
andfinalize
语句做得过火了。我绝不是 SQLite 专家,但在每次循环中准备语句时都必须有大量开销。
Quoting from the SQLite website:
引自 SQLite 网站:
After a prepared statement has been evaluated by one or more calls to
sqlite3_step()
, it can be reset in order to be evaluated again by a call tosqlite3_reset()
. Usingsqlite3_reset()
on an existing prepared statement rather creating a new prepared statement avoids unnecessary calls tosqlite3_prepare()
. In many SQL statements, the time needed to runsqlite3_prepare()
equals or exceeds the time needed bysqlite3_step()
. So avoiding calls tosqlite3_prepare()
can result in a significant performance improvement.
在通过一次或多次调用对准备好的语句求值后
sqlite3_step()
,可以将其重置以便通过对 的调用再次求值sqlite3_reset()
。使用sqlite3_reset()
在现有的准备语句,而创建一个新的准备好的声明中避免不必要的来电sqlite3_prepare()
。在许多 SQL 语句中,运行所需的时间sqlite3_prepare()
等于或超过sqlite3_step()
. 因此,避免调用sqlite3_prepare()
可以显着提高性能。
http://www.sqlite.org/cintro.html
http://www.sqlite.org/cintro.html
In your case, rather than preparing a new statement each time, you could try binding new values to your existing statement.
在您的情况下,您可以尝试将新值绑定到现有语句,而不是每次都准备一个新语句。
All this said, I think the indexes might be the actual culprit, since the time keeps increasing as you add more data. I am curious enough about this where I plan to do some testing over the weekend.
综上所述,我认为索引可能是真正的罪魁祸首,因为随着您添加更多数据,时间会不断增加。我对此很好奇,我计划在周末进行一些测试。
回答by Brian O'Kennedy
Answering my own question just as a place to put some details:
回答我自己的问题只是作为放置一些细节的地方:
It turns out (as correctly suggested above) that the index creation is the slow step, and every time I do another transaction of inserts, the index is updated which takes some time. My solution is to: (A) create the data table (B) insert all my historical data (several years worth) (C) create the indexes
事实证明(如上面正确建议的那样),索引创建是缓慢的步骤,每次我执行另一个插入事务时,索引都会更新,这需要一些时间。我的解决方案是:(A)创建数据表(B)插入我所有的历史数据(几年的价值)(C)创建索引
Now all lookups etc are really fast and sqlite does a great job. Subsequent daily updates now take a few seconds to insert only 800 records, but that is no problem since it only runs every 10 minutes or so.
现在所有的查找等都非常快,sqlite 做得很好。后续的每日更新现在只需几秒钟即可插入 800 条记录,但这没有问题,因为它每 10 分钟左右才运行一次。
Thanks to Robert Harvey and maxwellb for the help/suggestions/answers above.
感谢 Robert Harvey 和 maxwellb 提供的帮助/建议/答案。
回答by Robert Harvey
Since we know that capturing your data is fast when there is no index on the table, what might actually work is this:
由于我们知道当表上没有索引时捕获数据的速度很快,因此实际可行的是:
Capture the 800 values in a temporary table with no index.
Copy the records to the master table (containing indexes) using the form of INSERT INTO that takes a SELECT statement.
Delete the records from the temporary table.
在没有索引的临时表中捕获 800 个值。
使用采用 SELECT 语句的 INSERT INTO 形式将记录复制到主表(包含索引)。
从临时表中删除记录。
This technique is based on the theory that the INSERT INTO that takes a SELECT statement is faster than executing individual INSERTs.
此技术基于以下理论:采用 SELECT 语句的 INSERT INTO 比执行单个 INSERT 更快。
Step 2 can be executed in the background by using the Asynchronous Module, if it still proves to be a bit slow. This takes advantage of the bits of downtime between captures.
步骤 2 可以通过使用异步模块在后台执行,如果它仍然被证明有点慢。这利用了捕获之间的停机时间。
回答by maxwellb
Consider using a table for new inserts of the given day, without an index. Then, at the end of each day, run a script which will:
考虑在没有索引的情况下为给定日期的新插入使用一个表。然后,在每天结束时,运行一个脚本,该脚本将:
- Insert new values from new_table into master_table
- Clear the new_table for next day of processing
- 将 new_table 中的新值插入到 master_table 中
- 清除 new_table 以便第二天处理
If you can do lookups on historical data in O(log n), and lookups on today's data in O(n), this should provide a nice compromise.
如果您可以在 O(log n) 中查找历史数据,并在 O(n) 中查找今天的数据,这应该是一个很好的折衷方案。
回答by joe snyder
I can't tell from your specs, but if the ID field is always increasing, and the time field includes YYYYMMDD for uniqueness and is also always increasing, and you're doing either ID searches or time searches, then the simplest non-database solution would be to simply append all records to a fixed-field text or binary file (since they're being generated in "sorted" order) and use code to do a binary search for the desired records (eg, find the first record with the ID or time of interest, then sequentially step through the desired range).
我无法从您的规格中看出,但是如果 ID 字段总是在增加,并且时间字段包含 YYYYMMDD 以确保唯一性并且也一直在增加,并且您正在执行 ID 搜索或时间搜索,那么最简单的非数据库解决方案是简单地将所有记录附加到一个固定字段的文本或二进制文件中(因为它们是按“排序”顺序生成的)并使用代码对所需记录进行二进制搜索(例如,找到第一个记录ID 或感兴趣的时间,然后依次步进所需的范围)。
回答by Phil Goetz
When building large SQLite databases, always insert as much of the data as you can before creating the indices. That will run many times faster than if you create the indices before inserting the data.
在构建大型 SQLite 数据库时,始终在创建索引之前插入尽可能多的数据。这将比在插入数据之前创建索引快很多倍。
回答by Dwivedi Ji
The theoretical maximum number of rows in a table is 2^64(18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first. A 140 terabytes database can hold no more than approximately 1e+13 rows, and then only if there are no indices and if each row contains very little data.
表中的理论最大行数为2^64(18446744073709551616 或约 1.8e+19)。由于将首先达到 140 TB 的最大数据库大小,因此无法达到此限制。一个 140 TB 的数据库可以容纳不超过大约 1e+13 行,并且只有在没有索引并且每行包含很少数据的情况下。