PostgreSQL Simple JOIN 很慢

Question

提问by MichaelM

I have a simple query, and two tables:

我有一个简单的查询和两个表：

drilldown

CREATE SEQUENCE drilldown_id_seq;

CREATE TABLE drilldown (
    transactionid bigint NOT NULL DEFAULT nextval('drilldown_id_seq'),
    userid bigint NOT NULL default 0 REFERENCES users(id),
    pathid bigint NOT NULL default 0,
    reqms bigint NOT NULL default 0,
    quems bigint NOT NULL default 0,
    clicktime timestamp default current_timestamp,
    PRIMARY KEY(transactionid)
);

ALTER SEQUENCE drilldown_id_seq OWNED BY drilldown.transactionid;

CREATE INDEX drilldown_idx1 ON drilldown (clicktime);

querystats

CREATE SEQUENCE querystats_id_seq;

CREATE TABLE querystats (
    id bigint NOT NULL DEFAULT nextval('querystats_id_seq'),
    transactionid bigint NOT NULL default 0 REFERENCES drilldown(transactionid),
    querynameid bigint NOT NULL default 0 REFERENCES queryname(id),
    queryms bigint NOT NULL default 0,
    PRIMARY KEY(id)
);

ALTER SEQUENCE querystats_id_seq OWNED BY querystats.id;

CREATE INDEX querystats_idx1 ON querystats (transactionid);
CREATE INDEX querystats_idx2 ON querystats (querynameid);

drilldownhas 1.5 million records, and querystatshas 10 million records; the problem happens when I to a join between the two.

drilldown有150万条记录，querystats有1000万条记录；当我在两者之间加入时会出现问题。

QUERY

询问

explain analyse
select avg(qs.queryms)
  from querystats qs
  join drilldown d on (qs.transactionid=d.transactionid)
  where querynameid=1;

QUERY PLAN

查询计划

Aggregate  (cost=528596.96..528596.97 rows=1 width=8) (actual time=5213.154..5213.154 rows=1 loops=1)
   ->  Hash Join  (cost=274072.53..518367.59 rows=4091746 width=8) (actual time=844.087..3528.788 rows=4117717 loops=1)
         Hash Cond: (qs.transactionid = d.transactionid)
         ->  Bitmap Heap Scan on querystats qs  (cost=88732.62..210990.44 rows=4091746 width=16) (actual time=309.502..1321.029 rows=4117717 loops=1)
               Recheck Cond: (querynameid = 1)
               ->  Bitmap Index Scan on querystats_idx2  (cost=0.00..87709.68 rows=4091746 width=0) (actual time=307.916..307.916 rows=4117718 loops=1)
                     Index Cond: (querynameid = 1)
         ->  Hash  (cost=162842.29..162842.29 rows=1371250 width=8) (actual time=534.065..534.065 rows=1372574 loops=1)
               Buckets: 4096  Batches: 64  Memory Usage: 850kB
               ->  Index Scan using drilldown_pkey on drilldown d  (cost=0.00..162842.29 rows=1371250 width=8) (actual time=0.015..364.657 rows=1372574 loops=1)
 Total runtime: 5213.205 ms
(11 rows)

I know there are some tuning parameters I can adjust for PostgreSQL, but what I want to know is the query I am doing the most optimal way of joing the two tables?

我知道我可以为 PostgreSQL 调整一些调整参数，但我想知道的是我正在做的查询是连接两个表的最佳方式吗？

Or maybe some sort of INNER JOIN? I'm just not sure.

或者也许是某种内部连接？我只是不确定。

Any pointers are appreciated!

任何指针表示赞赏！

EDIT

编辑

database#\d drilldown
                                       Table "public.drilldown"
    Column     |            Type             |                       Modifiers                        
---------------+-----------------------------+--------------------------------------------------------
 transactionid | bigint                      | not null default nextval('drilldown_id_seq'::regclass)
 userid        | bigint                      | not null default 0
 pathid        | bigint                      | not null default 0
 reqms         | bigint                      | not null default 0
 quems         | bigint                      | not null default 0
 clicktime     | timestamp without time zone | default now()
Indexes:
    "drilldown_pkey" PRIMARY KEY, btree (transactionid)
    "drilldown_idx1" btree (clicktime)
Foreign-key constraints:
    "drilldown_userid_fkey" FOREIGN KEY (userid) REFERENCES users(id)
Referenced by:
    TABLE "querystats" CONSTRAINT "querystats_transactionid_fkey" FOREIGN KEY (transactionid) REFERENCES drilldown(transactionid)

database=# \d querystats
                            Table "public.querystats"
    Column     |  Type  |                        Modifiers                        
---------------+--------+---------------------------------------------------------
 id            | bigint | not null default nextval('querystats_id_seq'::regclass)
 transactionid | bigint | not null default 0
 querynameid   | bigint | not null default 0
 queryms       | bigint | not null default 0
Indexes:
    "querystats_pkey" PRIMARY KEY, btree (id)
    "querystats_idx1" btree (transactionid)
    "querystats_idx2" btree (querynameid)
Foreign-key constraints:
    "querystats_querynameid_fkey" FOREIGN KEY (querynameid) REFERENCES queryname(id)
    "querystats_transactionid_fkey" FOREIGN KEY (transactionid) REFERENCES drilldown(transactionid)

So here are the two tables requested and version

所以这里是请求的两个表和版本

PostgreSQL 9.1.7 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

So what this query is doing is getting the average from all the rows values of queryms for each query type (querynameid)

所以这个查询正在做的是从每个查询类型（querynameid）的 queryms 的所有行值中获取平均值

            name            |         current_setting          |        source        
----------------------------+----------------------------------+----------------------
 application_name           | psql                             | client
 client_encoding            | UTF8                             | client
 DateStyle                  | ISO, MDY                         | configuration file
 default_text_search_config | pg_catalog.english               | configuration file
 enable_seqscan             | off                              | session
 external_pid_file          | /var/run/postgresql/9.1-main.pid | configuration file
 lc_messages                | en_US.UTF-8                      | configuration file
 lc_monetary                | en_US.UTF-8                      | configuration file
 lc_numeric                 | en_US.UTF-8                      | configuration file
 lc_time                    | en_US.UTF-8                      | configuration file
 log_line_prefix            | %t                               | configuration file
 log_timezone               | localtime                        | environment variable
 max_connections            | 100                              | configuration file
 max_stack_depth            | 2MB                              | environment variable
 port                       | 5432                             | configuration file
 shared_buffers             | 24MB                             | configuration file
 ssl                        | on                               | configuration file
 TimeZone                   | localtime                        | environment variable
 unix_socket_directory      | /var/run/postgresql              | configuration file
(19 rows)

I see that enable_seqscan=off, I have not touched any settings, this is a completely default install.

我看到enable_seqscan=off，我没有碰任何设置，这是一个完全默认的安装。

UPDATE

更新

I made some changes from the below comments and here is the results.

我对下面的评论做了一些更改，这是结果。

explain analyse SELECT (SELECT avg(queryms) AS total FROM querystats WHERE querynameid=3) as total FROM querystats qs JOIN drilldown d ON (qs.transactionid=d.transactionid) WHERE qs.querynameid=3 limit 1;
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=196775.99..196776.37 rows=1 width=0) (actual time=2320.876..2320.876 rows=1 loops=1)
   InitPlan 1 (returns SELECT version();
SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');
)
     ->  Aggregate  (cost=196775.94..196775.99 rows=1 width=8) (actual time=2320.815..2320.815 rows=1 loops=1)
           ->  Bitmap Heap Scan on querystats  (cost=24354.25..189291.69 rows=2993698 width=8) (actual time=226.516..1144.690 rows=2999798 loops=1)
                 Recheck Cond: (querynameid = 3)
                 ->  Bitmap Index Scan on querystats_idx  (cost=0.00..23605.83 rows=2993698 width=0) (actual time=225.119..225.119 rows=2999798 loops=1)
                       Index Cond: (querynameid = 3)
   ->  Nested Loop  (cost=0.00..1127817.12 rows=2993698 width=0) (actual time=2320.876..2320.876 rows=1 loops=1)
         ->  Seq Scan on drilldown d  (cost=0.00..76745.10 rows=1498798 width=8) (actual time=0.009..0.009 rows=1 loops=1)
         ->  Index Scan using querystats_idx on querystats qs  (cost=0.00..0.60 rows=2 width=8) (actual time=0.045..0.045 rows=1 loops=1)
               Index Cond: ((querynameid = 3) AND (transactionid = d.transactionid))
 Total runtime: 2320.940 ms
(12 rows)

Answer 1

回答by kgrittn

It's behaving as though you have set enable_seqscan = off, because it is using an index scan to populate a hash table. Never set any of the planner options off except as a diagnostic step, and if you are showing a plan, please show any options used. This can be run to show a lot of the useful information:

它的行为就像您设置了一样enable_seqscan = off，因为它使用索引扫描来填充哈希表。除了作为诊断步骤外，切勿关闭任何计划器选项，如果您正在显示计划，请显示使用的任何选项。这可以运行以显示许多有用的信息：

SET seq_page_cost = 0.1;
SET random_page_cost = 0.1;
SET cpu_tuple_cost = 0.05;
SET effective_cache_size = '3GB'; -- actually use shared_buffers plus OS cache

It also helps if you tell us about the runtime environment, especially the amount of RAM on the machine, what your storage system looks like, and the size of the database (or even better, the active data setof frequently referenced data in the database).

如果您告诉我们有关运行时环境的信息，尤其是机器上的 RAM 量、您的存储系统的外观以及数据库的大小（或者甚至更好，数据库中经常引用的数据的活动数据集）也会有所帮助）。

As a rough breakdown, the 5.2 seconds breaks down to:

作为粗略的细分，5.2 秒细分为：

1.3 seconds to find the 4,117,717 querystatsrows that match your selection criterion.
2.3 seconds to randomly match those against drilldownrecords.
1.6 seconds to pass the 4,117,717 rows and calculate an average.

1.3 秒即可找到querystats符合您选择标准的 4,117,717行。
2.3 秒随机匹配drilldown记录。
通过 4,117,717 行并计算平均值需要 1.6 秒。

So, even though you seem to have crippled its ability to use the fastest plan, it is taking only 1.26 microseconds (millionths of a second) to locate each row, join it to another, and work it into a calculation of an average. That's not too bad on an absolute basis, but you can almost certainly get a slightly faster plan.

因此，即使您似乎削弱了它使用最快计划的能力，定位每一行、将其连接到另一行并将其用于计算平均值只需要 1.26 微秒（百万分之一秒）。这绝对不是太糟糕，但你几乎可以肯定会得到一个稍微快一点的计划。

First off, if you are using 9.2.x where x is less than 3, upgrade to 9.2.3 immediately. There was a performance regression for some types of plans which was fixed in the recent release which might affect this query. In general, try to stay up-to-date on minor releases(where version number changes past the second dot).

首先，如果您使用的是 9.2.x，其中 x 小于 3，请立即升级到 9.2.3。在最近的版本中修复了某些类型的计划的性能回归，这可能会影响此查询。一般来说，尽量保持最新的次要版本（版本号更改超过第二个点）。

You can test different plans in a single session by setting planning factors on just that connection and running your query (or an EXPLAINon it). Try something like this:

您可以在单个会话中测试不同的计划，方法是在该连接上设置计划因素并运行您的查询（或EXPLAIN在该连接上运行）。尝试这样的事情：

SET enable_seqscan = on;

Make sure that all enable_settings are on.

确保所有enable_设置都是on.

Answer 2

回答by Erwin Brandstetter

You claim in your question:

您在问题中声称：

I see that enable_seqscan=off, I have not touched any settings, this is a completely default install.

我看到enable_seqscan=off，我没有碰任何设置，这是一个完全默认的安装。

In contrast, the output from pg_settingstells us:

相反，从的输出pg_settings告诉我们：

enable_seqscan | off | session

enable_seqscan | 关闭 | 会议

Meaning, that you set enable_seqscan = offin your session. Something is not adding up here.

意思是，您enable_seqscan = off在 session 中设置。有什么东西没有在这里加起来。

Run

跑

RESET enable_seqscan;

or

或者

SHOW enable_seqscan;

Assert:

断言：

select avg(qs.queryms) 
from querystats qs 
join drilldown d 
  on (qs.transactionid=d.transactionid) 
where querynameid=1;

Also, your setting for shared_buffersis way too lowfor a db with millions of records. 24MBseems to be the conservative setting of Ubuntu out-of-the-box. You need to edit your configuration files for serious use! I quote the manual:

此外，你的设置shared_buffers是这样太低了数以百万计的记录的数据库。24MB似乎是 Ubuntu 开箱即用的保守设置。您需要编辑配置文件才能认真使用！我引用手册：

If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system.

如果您有一个具有 1GB 或更多 RAM 的专用数据库服务器，shared_buffers 的合理起始值是系统内存的 25%。

So edit your postgresql.conffile to increase the value and reload.
Then try your query again and find out how enable_seqscanwas turned off.

因此，编辑您的postgresql.conf文件以增加值并重新加载。
然后再次尝试您的查询并找出是如何enable_seqscan关闭的。

Answer 3

回答by Mike Sherrill 'Cat Recall'

In this query

在这个查询中

select avg(qs.queryms) 
from querystats qs 
where querynameid=1;

you're not using any of the columns from the table "drilldown". Since the foreign key constraint guarantees there's a row in "drilldown" for every "transactionid" in "querystats", I don't think the join will do anything useful. Unless I've missed something, your query is equivalent to

您没有使用“向下钻取”表中的任何列。由于外键约束保证“querystats”中的每个“transactionid”在“drilldown”中都有一行，我认为连接不会做任何有用的事情。除非我遗漏了什么，否则您的查询相当于

SELECT 
    (SELECT avg(queryms) FROM querystats WHERE querynameid=1) 
FROM querystats qs 
JOIN drilldown d ON (qs.transactionid=d.transactionid) 
WHERE qs.querynameid=1;

No join at all. As long as there's an index on "querynameid" you should get decent performance.

根本没有加入。只要“querynameid”上有索引，您就应该获得不错的性能。

Answer 4

回答by Brane

When you don't join, avg(qs.queryms)executes once.

不加入时，avg(qs.queryms)执行一次。

When you do the join, you are executing avg(qs.queryms)as many times as there are rows generated by the join.

执行联接时，执行的avg(qs.queryms)次数与联接生成的行数一样多。

If you're always interested in a single querynameid, try putting avg(qs.queryms)in a subselect:

如果您总是对单个 querynameid 感兴趣，请尝试放入子avg(qs.queryms)选择：

-- CREATE SEQUENCE querystats_id_seq;

CREATE TABLE querystats (
    -- id bigint NOT NULL DEFAULT nextval('querystats_id_seq'),
    transactionid bigint NOT NULL default 0 REFERENCES drilldown(transactionid),
    querynameid bigint NOT NULL default 0 REFERENCES queryname(id),
    queryms bigint NOT NULL default 0,
    PRIMARY KEY(transactionid,querynameid )
);

-- ALTER SEQUENCE querystats_id_seq OWNED BY querystats.id;

--CREATE INDEX querystats_idx1 ON querystats (transactionid);
-- CREATE INDEX querystats_idx2 ON querystats (querynameid);
CREATE UNIQUE INDEX querystats_alt ON querystats (querynameid, transactionid);

Answer 5

回答by wildplasser

The querystats table looks like a fatjunction table to me. In that case: omit the surrogate key, and live on the natural (composite) key (both components already are not NULLable) and add a reversedcomposite index. (the separate indices are useless, the FK constraint generates them automatically for you anyway)

querystats 表对我来说看起来像一个胖联结表。在这种情况下：省略代理键，并使用自然（复合）键（两个组件都已经不是可空的）并添加反向复合索引。（单独的索引没有用，无论如何 FK 约束会自动为您生成它们）

##代码##

PostgreSQL Simple JOIN 很慢

提问by MichaelM

EDIT

编辑

回答by kgrittn

回答by Erwin Brandstetter

回答by Mike Sherrill 'Cat Recall'

回答by Brane

回答by wildplasser

相关推荐

最近更新

标签

PostgreSQL Simple JOIN 很慢

提问by MichaelM

EDIT

编辑

回答by kgrittn

回答by Erwin Brandstetter

回答by Mike Sherrill 'Cat Recall'

回答by Brane

回答by wildplasser

相关推荐

PostgreSQL 索引不用于范围查询

postgresql 如何在 /tmp 文件夹中创建永久符号链接以启动 postgres？

postgresql 在 psql 中自定义寻呼机

postgresql Hibernate 启动很慢

相关推荐

最近更新

标签