SQL 提高查询速度:大postgres表中的简单SELECT
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13234812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Improving query speed: simple SELECT in big postgres table
提问by alexdemartos
I'm having trouble regarding speed in a SELECT query on a Postgres database.
我在 Postgres 数据库上的 SELECT 查询速度方面遇到问题。
I have a table with two integer columns as key: (int1,int2) This table has around 70 million rows.
我有一个以两个整数列作为键的表:(int1,int2) 这个表有大约 7000 万行。
I need to make two kind of simple SELECT queries in this environment:
我需要在这种环境中进行两种简单的 SELECT 查询:
SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;
These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:
这两个选择返回这 7000 万行中的每一个大约 10.000 行。为了尽可能快地工作,我考虑使用两个 HASH 索引,每列一个。不幸的是,结果并不那么好:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on lec_sim (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
Recheck Cond: (lec2_id = 11782)
-> Bitmap Index Scan on lec_sim_lec2_hash_ind (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
Index Cond: (lec2_id = 11782)
Total runtime: 23342.534 ms
(5 rows)
This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.
这是这些查询之一的 EXPLAIN ANALYZE 示例。大约需要 23 秒。我的期望是在不到一秒钟的时间内获得这些信息。
These are some parameters of the postgres db config:
这些是 postgres db 配置的一些参数:
work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB
Any help, comment or thought would be really appreciated.
任何帮助、评论或想法将不胜感激。
Thank you in advance.
先感谢您。
回答by willglynn
Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
将我的评论提取到答案中:这里的索引查找非常快——所有时间都花在检索实际行上。23 秒 / 7871 行 = 每行 2.9 毫秒,这对于检索分散在磁盘子系统中的数据是合理的。寻找很慢;您可以 a) 将数据集放入 RAM 中,b) 购买 SSD,或 c) 提前组织数据以最大程度地减少搜索。
PostgreSQL 9.2 has a feature called index-only scansthat allows it to (usually) answer queries without accessing the table. You can combine this with the btree
index property of automatically maintaining order to make this query fast. You mention int1
, int2
, and two floats:
PostgreSQL 9.2 有一个称为仅索引扫描的功能,允许它(通常)在不访问表的情况下回答查询。您可以将其与btree
自动维护订单的索引属性结合起来,以加快此查询的速度。您提到int1
,int2
和两个浮点数:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.
还要注意,这不会神奇地擦除磁盘搜索,它只是将它们从查询时间移动到插入时间。它还需要您的存储空间,因为您正在复制数据。不过,这可能是您想要的权衡。
回答by alexdemartos
Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:
谢谢威尔格林。正如您所注意到的,问题在于通过 HD 查找而不是查找索引。您提出了许多解决方案,例如将数据集加载到 RAM 中或购买 SSD 硬盘。但是忘记这两个,涉及管理数据库本身之外的事情,您提出了两个想法:
- Reorganize the data to reduce the seeking of the data.
- Use PostgreSQL 9.2 feature "index-only scans"
- 重新组织数据以减少对数据的查找。
- 使用 PostgreSQL 9.2 特性“仅索引扫描”
Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".
由于我在 PostgreSQL 9.1 服务器下,我决定选择“1”。
I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.
我复制了一张桌子。所以现在我有两次相同数据的同一张表。我为每个索引创建了一个索引,第一个由 (int1) 索引,第二个由 (int2) 索引。然后我通过它们各自的索引将它们聚集在一起(CLUSTER table USING ind_intX)。
I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:
我现在发布相同查询的 EXPLAIN ANALYZE,在这些聚簇表之一中完成:
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1) Index Cond: (lec2_id = 12300) Total runtime: 1.822 ms (3 rows)
Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.
现在寻找真的很快。我从 23 秒减少到约 2 毫秒,这是一个令人印象深刻的改进。我认为这个问题对我来说已经解决了,我希望这对遇到同样问题的其他人也有用。
Thank you so much willglynn.
非常感谢willglynn。
回答by Robert Casey
I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.
我有一个超慢查询的案例,其中简单的一对多连接(在 PG v9.1 中)在一个 3300 万行的表和一个 24 亿行的子表之间执行。我对子表的外键索引执行了 CLUSTER,但发现这并没有解决我的查询超时问题,即使是最简单的查询。运行 ANALYZE 也没有解决问题。
What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.
产生巨大差异的是在父表和子表上执行手动 VACUUM。即使父表正在完成其 VACUUM 过程,我也从 10 分钟超时到一秒内返回结果。
What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.
我从中得出的结论是,即使对于 v9.1,常规的 VACUUM 操作仍然很重要。我这样做的原因是我注意到 autovacuum 至少有两周没有在任何一个表上运行,而且从那时起发生了很多 upsert 和插入。可能我需要改进 autovacuum 触发器来解决这个问题,但我可以说的是,如果所有内容都被清理干净,那么一个包含数十亿行的 640GB 表确实表现良好。我还没有对表进行分区以获得良好的性能。
回答by Nick Woodhams
For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:
对于非常简单有效的单衬,如果您的 postgres 机器上有快速固态存储,请尝试设置:
random_page_cost=1.0
In your in your postgresql.conf
.
在你在你的postgresql.conf
.
The default is random_page_cost=4.0
and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)
默认是random_page_cost=4.0
,这是针对具有高寻道时间的存储(如旧的旋转磁盘)进行了优化。这改变了寻找的成本计算并且更少地依赖你的内存(无论如何最终可能会交换)
This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.
仅此设置就将我的过滤查询从 8 秒缩短到 2 秒,以处理包含几百万条记录的长表。
The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check @willglynn's answer for that.
另一个主要改进来自使用我表上所有 booleen 列创建索引。这将 2 秒的查询减少到大约 1 秒。检查@willglynn 的答案。
Hope this helps!
希望这可以帮助!