PostgreSQL:在数百万行的表中加速 SELECT 查询
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45320455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PostgreSQL: speed up SELECT query in table with millions of rows
提问by J0ANMM
I have a table with > 4.5 million rows and my SELECT
query is far too slow for my needs.
我有一个超过 450 万行的表,我的SELECT
查询对于我的需要来说太慢了。
The table is created with:
该表是通过以下方式创建的:
CREATE TABLE all_legs (
carrier TEXT,
dep_hub TEXT,
arr_hub TEXT,
dep_dt TIMESTAMP WITH TIME ZONE,
arr_dt TIMESTAMP WITH TIME ZONE,
price_ct INTEGER,
... 5 more cols ...,
PRIMARY KEY (carrier, dep_hub, arr_hub, dep_dt, arr_dt, ...3 other cols...)
)
When I want to SELECT
all rows for a certain date, the query is too slow; it takes between 12sec and 20 seconds. My aim is that it takes max 1 sec. I expect the query to return between 0.1% and 1% of the rows contained in the table.
当我想要SELECT
某个日期的所有行时,查询太慢;它需要 12 秒到 20 秒之间。我的目标是最多需要 1 秒。我希望查询返回表中包含的行的 0.1% 到 1%。
The query is quite simple:
查询非常简单:
SELECT * FROM all_legs WHERE dep_dt::date = '2017-08-15' ORDER BY price_ct ASC
EXPLAIN ANALYZE
returns:
EXPLAIN ANALYZE
返回:
Sort (cost=197154.69..197212.14 rows=22982 width=696) (actual time=14857.300..14890.565 rows=31074 loops=1)
Sort Key: price_ct
Sort Method: external merge Disk: 5256kB
-> Seq Scan on all_legs (cost=0.00..188419.85 rows=22982 width=696) (actual time=196.738..14581.143 rows=31074 loops=1)
Filter: ((dep_dt)::date = '2017-08-15'::date)
Rows Removed by Filter: 4565249
Planning time: 0.572 ms
Execution time: 14908.274 ms
Note: I learned yesterday about this command, so I am still not able to fully understand all that is returned.
注意:我昨天了解了这个命令,所以我仍然无法完全理解返回的所有内容。
I have tried using index-only scans
, as suggested here, by running the command: CREATE index idx_all_legs on all_legs(dep_dt);
but I did not notice any difference in running time. I also tried creating the index for all columns, as I want all columns return.
我已尝试index-only scans
按照此处的建议通过运行以下命令来使用:CREATE index idx_all_legs on all_legs(dep_dt);
但我没有注意到运行时间有任何差异。我还尝试为所有列创建索引,因为我希望所有列都返回。
Another thought was sorting all rows by dep_dt
, so then the search of all rows fulfilling the condition should be much faster as they would not be scattered. Unfortunately, I don't know how to implement this.
另一个想法是按 对所有行进行排序dep_dt
,因此满足条件的所有行的搜索应该快得多,因为它们不会分散。不幸的是,我不知道如何实现这一点。
Is there a way to make it as fast as I am aiming to?
有没有办法让它像我的目标一样快?
Solution
解决方案
As suggested in the Laurenz' answer, by adding an index CREATE INDEX IF NOT EXISTS idx_dep_dt_price ON all_legs(dep_dt, price_ct);
and adapting the condition in the SELECT
to WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
has reduced the running time to 1/4. Even if it is a very good improvement, that means running times between 2 and 6 seconds.
正如Laurenz 的回答中所建议的,通过添加索引CREATE INDEX IF NOT EXISTS idx_dep_dt_price ON all_legs(dep_dt, price_ct);
并调整SELECT
to 中的条件WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
已将运行时间减少到 1/4。即使这是一个非常好的改进,也意味着运行时间在 2 到 6 秒之间。
Any additional idea to reduce the running time even further would be appreciated.
任何进一步减少运行时间的额外想法将不胜感激。
回答by Laurenz Albe
The index won't help.
该指数无济于事。
Two solutions:
两种解决方案:
You chould either change the query to:
WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
Then the index can be used.
Create an index on an expression:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));
(or a different time zone) and change the query to
WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'
The
AT TIME ZONE
is necessary because otherwise the result of the cast would depend on your currentTimeZone
setting.
您可以将查询更改为:
WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
然后就可以使用索引了。
在表达式上创建索引:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));
(或不同的时区)并将查询更改为
WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'
这
AT TIME ZONE
是必要的,否则转换的结果将取决于您当前的TimeZone
设置。
The first solution is simpler, but the second has the advantage that you can add price_ct
to the index like this:
第一个解决方案更简单,但第二个解决方案的优点是您可以price_ct
像这样添加到索引中:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);
Then you don't need a sort any more, and your query will be as fast as it can theoretically get.
然后您不再需要排序,并且您的查询将尽可能快地达到理论上的速度。
回答by peufeu
The index does not help because you use
该索引没有帮助,因为您使用
WHERE dept_dt::date=constant
This seems fine to a beginner, but to the database, it looks like:
这对初学者来说似乎很好,但对于数据库来说,它看起来像:
WHERE convert_timestamp_to_date(dep_ts)=constant
With convert_timestamp_to_date() being an arbitrary function (I just came up with the name, don't look it up in the docs). In order to use the index on dep_ts, the DB would have to reverse the function convert_timestamp_to_date into something like convert_date_to_timestamp_range (because a date corresponds to a range of timestamps, not just one timestamp), and then rewrite the WHERE as Laurenz did.
convert_timestamp_to_date() 是一个任意函数(我只是想出了这个名字,不要在文档中查找)。为了使用 dep_ts 上的索引,DB 必须将函数 convert_timestamp_to_date 反转为类似于 convert_date_to_timestamp_range 的东西(因为日期对应于一系列时间戳,而不仅仅是一个时间戳),然后像 Laurenz 那样重写 WHERE。
Since there are many such functions, the database developers didn't bother to maintain a huge table of how to invert them. Also it would only help for special cases. For example, if you specified a date range in your WHERE instead of a "=constant" then it would be yet another special case. It's your job to handle this ;)
由于有很多这样的函数,数据库开发人员没有费心去维护一个关于如何反转它们的巨大表格。此外,它只会对特殊情况有帮助。例如,如果您在 WHERE 中指定了日期范围而不是“=constant”,那么这将是另一种特殊情况。处理这个是你的工作;)
Also, an index on (dep_dt,price_ct) won't speed up the sort as the first column is a timestamp, so the rows are not ordered in the index the way you want. You'd need an index on (dept_dt::date, price_ct) to eliminate the sort.
此外, (dep_dt,price_ct) 上的索引不会加速排序,因为第一列是时间戳,因此索引中的行没有按照您想要的方式排序。您需要 (dept_dt::date, price_ct) 上的索引来消除排序。
Now, which index to create? This depends...
现在,要创建哪个索引?这取决于...
If you also use timestamp range queries like "WHERE dep_dt BETWEEN ... AND ..." then the index on dep_dt needs to be the original timestamp type. In this case, creating another index on the same column, but converted to date, would be unnecessary (all indexes have to be updated on writes, so unnecessary indexes slow down inserts/updates). However, if you use the index on (dep_ts::date,price_ct) lots and lots of times and eliminating the sort is really important for you, then it may make sense. It's a tradeoff.
如果您还使用诸如“WHERE dep_dt BETWEEN ... AND ...”之类的时间戳范围查询,则 dep_dt 上的索引需要是原始时间戳类型。在这种情况下,在同一列上创建另一个索引但转换为日期将是不必要的(所有索引都必须在写入时更新,因此不必要的索引会减慢插入/更新的速度)。但是,如果您在 (dep_ts::date,price_ct) 上使用索引很多次并且消除排序对您来说非常重要,那么它可能是有意义的。这是一个权衡。
回答by Adam Tokarski
The very first you should change here is to remove composite
primary key
and use plain one-column one instead of this. This is because if you're going to use some columns index, it works the best with something like one column integer index which is here like a spine and allows your index to fetch fast rows you need to. When you have such big index like in your example, the planner may say that it will be faster for him to scan through whole table.Even if your index would be good enough to be used by planner, it may be dropped by ordering. I say that 'may be' as - as many things in sql - it depends on your actuall data in table, analyses, and so on. I'm not sure about Postgres but you may want to try to build another index on column used in
order by
or even to try composite index for(dep_dt, price_ct)
. Also you may try to putdep_dt
toorder by
list to give a compiler a hint.Do you need allfrom this table? Using
*
vsid
(for example) can also have a impact here.How unique values you have in
dep_dt
column? Sometimes planner can say that it may be more effective in making scan through whole table than by index because there is many non-unique valueshere.
你应该在这里改变的第一个是删除复合
primary key
并使用普通的一列而不是这个。这是因为如果你打算使用一些列索引,它最适合像一列整数索引这样的东西,它就像一个脊柱,允许你的索引快速获取你需要的行。当你有像你的例子那样大的索引时,规划者可能会说他扫描整个表会更快。即使您的索引足以被计划者使用,它也可能会被 ordering 删除。我说“可能”与 sql 中的许多内容一样 - 这取决于您在表、分析等中的实际数据。我不知道Postgres的,但你可能想尝试的基础上在使用的列另一个索引
order by
,甚至尝试的综合指数(dep_dt, price_ct)
。你也可以尝试把dep_dt
以order by
列表给出一个编译器的提示。你需要这张桌子上的所有东西吗?使用
*
vsid
(例如)在这里也会产生影响。你在
dep_dt
列中有多少独特的值?有时planner会说全表扫描可能比索引扫描更有效,因为这里有很多非唯一值。
In summary, SQL querying is art of experimenting, as it all depends on current data (as planner is using statistics build by analyzer to guess optimal query plan). So it may even happen that when you have tuned query to table with thousand of rows, it may stop working when you reach millions.
总之,SQL 查询是实验的艺术,因为它完全取决于当前数据(因为规划器使用分析器构建的统计信息来猜测最佳查询计划)。因此,甚至可能会发生这样的情况,当您将查询调整到具有数千行的表时,当您达到数百万行时,它可能会停止工作。