PostgreSQL Simple JOIN 很慢
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14793708/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PostgreSQL Simple JOIN very slow
提问by MichaelM
I have a simple query, and two tables:
我有一个简单的查询和两个表:
drilldown
drilldown
CREATE SEQUENCE drilldown_id_seq;
CREATE TABLE drilldown (
transactionid bigint NOT NULL DEFAULT nextval('drilldown_id_seq'),
userid bigint NOT NULL default 0 REFERENCES users(id),
pathid bigint NOT NULL default 0,
reqms bigint NOT NULL default 0,
quems bigint NOT NULL default 0,
clicktime timestamp default current_timestamp,
PRIMARY KEY(transactionid)
);
ALTER SEQUENCE drilldown_id_seq OWNED BY drilldown.transactionid;
CREATE INDEX drilldown_idx1 ON drilldown (clicktime);
querystats
querystats
CREATE SEQUENCE querystats_id_seq;
CREATE TABLE querystats (
id bigint NOT NULL DEFAULT nextval('querystats_id_seq'),
transactionid bigint NOT NULL default 0 REFERENCES drilldown(transactionid),
querynameid bigint NOT NULL default 0 REFERENCES queryname(id),
queryms bigint NOT NULL default 0,
PRIMARY KEY(id)
);
ALTER SEQUENCE querystats_id_seq OWNED BY querystats.id;
CREATE INDEX querystats_idx1 ON querystats (transactionid);
CREATE INDEX querystats_idx2 ON querystats (querynameid);
drilldown
has 1.5 million records, and querystats
has 10 million records; the problem happens when I to a join between the two.
drilldown
有150万条记录,querystats
有1000万条记录;当我在两者之间加入时会出现问题。
QUERY
询问
explain analyse
select avg(qs.queryms)
from querystats qs
join drilldown d on (qs.transactionid=d.transactionid)
where querynameid=1;
QUERY PLAN
查询计划
Aggregate (cost=528596.96..528596.97 rows=1 width=8) (actual time=5213.154..5213.154 rows=1 loops=1) -> Hash Join (cost=274072.53..518367.59 rows=4091746 width=8) (actual time=844.087..3528.788 rows=4117717 loops=1) Hash Cond: (qs.transactionid = d.transactionid) -> Bitmap Heap Scan on querystats qs (cost=88732.62..210990.44 rows=4091746 width=16) (actual time=309.502..1321.029 rows=4117717 loops=1) Recheck Cond: (querynameid = 1) -> Bitmap Index Scan on querystats_idx2 (cost=0.00..87709.68 rows=4091746 width=0) (actual time=307.916..307.916 rows=4117718 loops=1) Index Cond: (querynameid = 1) -> Hash (cost=162842.29..162842.29 rows=1371250 width=8) (actual time=534.065..534.065 rows=1372574 loops=1) Buckets: 4096 Batches: 64 Memory Usage: 850kB -> Index Scan using drilldown_pkey on drilldown d (cost=0.00..162842.29 rows=1371250 width=8) (actual time=0.015..364.657 rows=1372574 loops=1) Total runtime: 5213.205 ms (11 rows)
I know there are some tuning parameters I can adjust for PostgreSQL, but what I want to know is the query I am doing the most optimal way of joing the two tables?
我知道我可以为 PostgreSQL 调整一些调整参数,但我想知道的是我正在做的查询是连接两个表的最佳方式吗?
Or maybe some sort of INNER JOIN? I'm just not sure.
或者也许是某种内部连接?我只是不确定。
Any pointers are appreciated!
任何指针表示赞赏!
EDIT
编辑
database#\d drilldown
Table "public.drilldown"
Column | Type | Modifiers
---------------+-----------------------------+--------------------------------------------------------
transactionid | bigint | not null default nextval('drilldown_id_seq'::regclass)
userid | bigint | not null default 0
pathid | bigint | not null default 0
reqms | bigint | not null default 0
quems | bigint | not null default 0
clicktime | timestamp without time zone | default now()
Indexes:
"drilldown_pkey" PRIMARY KEY, btree (transactionid)
"drilldown_idx1" btree (clicktime)
Foreign-key constraints:
"drilldown_userid_fkey" FOREIGN KEY (userid) REFERENCES users(id)
Referenced by:
TABLE "querystats" CONSTRAINT "querystats_transactionid_fkey" FOREIGN KEY (transactionid) REFERENCES drilldown(transactionid)
database=# \d querystats
Table "public.querystats"
Column | Type | Modifiers
---------------+--------+---------------------------------------------------------
id | bigint | not null default nextval('querystats_id_seq'::regclass)
transactionid | bigint | not null default 0
querynameid | bigint | not null default 0
queryms | bigint | not null default 0
Indexes:
"querystats_pkey" PRIMARY KEY, btree (id)
"querystats_idx1" btree (transactionid)
"querystats_idx2" btree (querynameid)
Foreign-key constraints:
"querystats_querynameid_fkey" FOREIGN KEY (querynameid) REFERENCES queryname(id)
"querystats_transactionid_fkey" FOREIGN KEY (transactionid) REFERENCES drilldown(transactionid)
So here are the two tables requested and version
所以这里是请求的两个表和版本
PostgreSQL 9.1.7 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
So what this query is doing is getting the average from all the rows values of queryms for each query type (querynameid)
所以这个查询正在做的是从每个查询类型(querynameid)的 queryms 的所有行值中获取平均值
name | current_setting | source
----------------------------+----------------------------------+----------------------
application_name | psql | client
client_encoding | UTF8 | client
DateStyle | ISO, MDY | configuration file
default_text_search_config | pg_catalog.english | configuration file
enable_seqscan | off | session
external_pid_file | /var/run/postgresql/9.1-main.pid | configuration file
lc_messages | en_US.UTF-8 | configuration file
lc_monetary | en_US.UTF-8 | configuration file
lc_numeric | en_US.UTF-8 | configuration file
lc_time | en_US.UTF-8 | configuration file
log_line_prefix | %t | configuration file
log_timezone | localtime | environment variable
max_connections | 100 | configuration file
max_stack_depth | 2MB | environment variable
port | 5432 | configuration file
shared_buffers | 24MB | configuration file
ssl | on | configuration file
TimeZone | localtime | environment variable
unix_socket_directory | /var/run/postgresql | configuration file
(19 rows)
I see that enable_seqscan=off, I have not touched any settings, this is a completely default install.
我看到enable_seqscan=off,我没有碰任何设置,这是一个完全默认的安装。
UPDATE
更新
I made some changes from the below comments and here is the results.
我对下面的评论做了一些更改,这是结果。
explain analyse SELECT (SELECT avg(queryms) AS total FROM querystats WHERE querynameid=3) as total FROM querystats qs JOIN drilldown d ON (qs.transactionid=d.transactionid) WHERE qs.querynameid=3 limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=196775.99..196776.37 rows=1 width=0) (actual time=2320.876..2320.876 rows=1 loops=1)
InitPlan 1 (returns SELECT version();
SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');
)
-> Aggregate (cost=196775.94..196775.99 rows=1 width=8) (actual time=2320.815..2320.815 rows=1 loops=1)
-> Bitmap Heap Scan on querystats (cost=24354.25..189291.69 rows=2993698 width=8) (actual time=226.516..1144.690 rows=2999798 loops=1)
Recheck Cond: (querynameid = 3)
-> Bitmap Index Scan on querystats_idx (cost=0.00..23605.83 rows=2993698 width=0) (actual time=225.119..225.119 rows=2999798 loops=1)
Index Cond: (querynameid = 3)
-> Nested Loop (cost=0.00..1127817.12 rows=2993698 width=0) (actual time=2320.876..2320.876 rows=1 loops=1)
-> Seq Scan on drilldown d (cost=0.00..76745.10 rows=1498798 width=8) (actual time=0.009..0.009 rows=1 loops=1)
-> Index Scan using querystats_idx on querystats qs (cost=0.00..0.60 rows=2 width=8) (actual time=0.045..0.045 rows=1 loops=1)
Index Cond: ((querynameid = 3) AND (transactionid = d.transactionid))
Total runtime: 2320.940 ms
(12 rows)
回答by kgrittn
It's behaving as though you have set enable_seqscan = off
, because it is using an index scan to populate a hash table. Never set any of the planner options off except as a diagnostic step, and if you are showing a plan, please show any options used. This can be run to show a lot of the useful information:
它的行为就像您设置了一样enable_seqscan = off
,因为它使用索引扫描来填充哈希表。除了作为诊断步骤外,切勿关闭任何计划器选项,如果您正在显示计划,请显示使用的任何选项。这可以运行以显示许多有用的信息:
SET seq_page_cost = 0.1;
SET random_page_cost = 0.1;
SET cpu_tuple_cost = 0.05;
SET effective_cache_size = '3GB'; -- actually use shared_buffers plus OS cache
It also helps if you tell us about the runtime environment, especially the amount of RAM on the machine, what your storage system looks like, and the size of the database (or even better, the active data setof frequently referenced data in the database).
如果您告诉我们有关运行时环境的信息,尤其是机器上的 RAM 量、您的存储系统的外观以及数据库的大小(或者甚至更好,数据库中经常引用的数据的活动数据集)也会有所帮助)。
As a rough breakdown, the 5.2 seconds breaks down to:
作为粗略的细分,5.2 秒细分为:
- 1.3 seconds to find the 4,117,717
querystats
rows that match your selection criterion. - 2.3 seconds to randomly match those against
drilldown
records. - 1.6 seconds to pass the 4,117,717 rows and calculate an average.
- 1.3 秒即可找到
querystats
符合您选择标准的 4,117,717行。 - 2.3 秒随机匹配
drilldown
记录。 - 通过 4,117,717 行并计算平均值需要 1.6 秒。
So, even though you seem to have crippled its ability to use the fastest plan, it is taking only 1.26 microseconds (millionths of a second) to locate each row, join it to another, and work it into a calculation of an average. That's not too bad on an absolute basis, but you can almost certainly get a slightly faster plan.
因此,即使您似乎削弱了它使用最快计划的能力,定位每一行、将其连接到另一行并将其用于计算平均值只需要 1.26 微秒(百万分之一秒)。这绝对不是太糟糕,但你几乎可以肯定会得到一个稍微快一点的计划。
First off, if you are using 9.2.x where x is less than 3, upgrade to 9.2.3 immediately. There was a performance regression for some types of plans which was fixed in the recent release which might affect this query. In general, try to stay up-to-date on minor releases(where version number changes past the second dot).
首先,如果您使用的是 9.2.x,其中 x 小于 3,请立即升级到 9.2.3。在最近的版本中修复了某些类型的计划的性能回归,这可能会影响此查询。一般来说,尽量保持最新的次要版本(版本号更改超过第二个点)。
You can test different plans in a single session by setting planning factors on just that connection and running your query (or an EXPLAIN
on it). Try something like this:
您可以在单个会话中测试不同的计划,方法是在该连接上设置计划因素并运行您的查询(或EXPLAIN
在该连接上运行)。尝试这样的事情:
SET enable_seqscan = on;
Make sure that all enable_
settings are on
.
确保所有enable_
设置都是on
.
回答by Erwin Brandstetter
You claim in your question:
您在问题中声称:
I see that enable_seqscan=off, I have not touched any settings, this is a completely default install.
我看到enable_seqscan=off,我没有碰任何设置,这是一个完全默认的安装。
In contrast, the output from pg_settings
tells us:
相反,从的输出pg_settings
告诉我们:
enable_seqscan | off | session
enable_seqscan | 关闭 | 会议
Meaning, that you set enable_seqscan = off
in your session. Something is not adding up here.
意思是,您enable_seqscan = off
在 session 中设置。有什么东西没有在这里加起来。
Run
跑
RESET enable_seqscan;
or
或者
SHOW enable_seqscan;
Assert:
断言:
select avg(qs.queryms)
from querystats qs
join drilldown d
on (qs.transactionid=d.transactionid)
where querynameid=1;
Also, your setting for shared_buffers
is way too lowfor a db with millions of records. 24MB
seems to be the conservative setting of Ubuntu out-of-the-box. You need to edit your configuration files for serious use! I quote the manual:
此外,你的设置shared_buffers
是这样太低了数以百万计的记录的数据库。24MB
似乎是 Ubuntu 开箱即用的保守设置。您需要编辑配置文件才能认真使用!我引用手册:
If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system.
如果您有一个具有 1GB 或更多 RAM 的专用数据库服务器,shared_buffers 的合理起始值是系统内存的 25%。
So edit your postgresql.conf
file to increase the value and reload.
Then try your query again and find out how enable_seqscan
was turned off.
因此,编辑您的postgresql.conf
文件以增加值并重新加载。
然后再次尝试您的查询并找出是如何enable_seqscan
关闭的。
回答by Mike Sherrill 'Cat Recall'
In this query
在这个查询中
select avg(qs.queryms)
from querystats qs
where querynameid=1;
you're not using any of the columns from the table "drilldown". Since the foreign key constraint guarantees there's a row in "drilldown" for every "transactionid" in "querystats", I don't think the join will do anything useful. Unless I've missed something, your query is equivalent to
您没有使用“向下钻取”表中的任何列。由于外键约束保证“querystats”中的每个“transactionid”在“drilldown”中都有一行,我认为连接不会做任何有用的事情。除非我遗漏了什么,否则您的查询相当于
SELECT
(SELECT avg(queryms) FROM querystats WHERE querynameid=1)
FROM querystats qs
JOIN drilldown d ON (qs.transactionid=d.transactionid)
WHERE qs.querynameid=1;
No join at all. As long as there's an index on "querynameid" you should get decent performance.
根本没有加入。只要“querynameid”上有索引,您就应该获得不错的性能。
回答by Brane
When you don't join, avg(qs.queryms)
executes once.
不加入时,avg(qs.queryms)
执行一次。
When you do the join, you are executing avg(qs.queryms)
as many times as there are rows generated by the join.
执行联接时,执行的avg(qs.queryms)
次数与联接生成的行数一样多。
If you're always interested in a single querynameid, try putting avg(qs.queryms)
in a subselect:
如果您总是对单个 querynameid 感兴趣,请尝试放入子avg(qs.queryms)
选择:
-- CREATE SEQUENCE querystats_id_seq;
CREATE TABLE querystats (
-- id bigint NOT NULL DEFAULT nextval('querystats_id_seq'),
transactionid bigint NOT NULL default 0 REFERENCES drilldown(transactionid),
querynameid bigint NOT NULL default 0 REFERENCES queryname(id),
queryms bigint NOT NULL default 0,
PRIMARY KEY(transactionid,querynameid )
);
-- ALTER SEQUENCE querystats_id_seq OWNED BY querystats.id;
--CREATE INDEX querystats_idx1 ON querystats (transactionid);
-- CREATE INDEX querystats_idx2 ON querystats (querynameid);
CREATE UNIQUE INDEX querystats_alt ON querystats (querynameid, transactionid);
回答by wildplasser
The querystats table looks like a fatjunction table to me. In that case: omit the surrogate key, and live on the natural (composite) key (both components already are not NULLable) and add a reversedcomposite index. (the separate indices are useless, the FK constraint generates them automatically for you anyway)
querystats 表对我来说看起来像一个胖联结表。在这种情况下:省略代理键,并使用自然(复合)键(两个组件都已经不是可空的)并添加反向复合索引。(单独的索引没有用,无论如何 FK 约束会自动为您生成它们)
##代码##