postgresql Postgres Hash 加入速度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16652793/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 00:55:43  来源:igfitidea点击:

Postgres Hash Join speed

performancepostgresql

提问by Jon

I have 3 tables which I wish to join using inner joins in Postgres 9.1, reads, devices, and device_patients. Below is an abbreviated schema for each table.

我有 3 个表,我希望在 Postgres 9.1、reads、devices 和 device_patients 中使用内部联接来加入它们。下面是每个表的缩写模式。

reads -- ~250,000 rows

读取 -- ~250,000 行

CREATE TABLE reads
  (
    id serial NOT NULL,
    device_id integer NOT NULL,
    value bigint NOT NULL,
    read_datetime timestamp without time zone NOT NULL,
    created_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL,
    CONSTRAINT reads_pkey PRIMARY KEY (id )
  )
WITH (
  OIDS=FALSE
);
ALTER TABLE reads
  OWNER TO postgres;

CREATE INDEX index_reads_on_device_id
  ON reads
  USING btree
  (device_id );

CREATE INDEX index_reads_on_read_datetime
  ON reads
  USING btree
  (read_datetime );

devices -- ~500 rows

设备 -- ~500 行

CREATE TABLE devices
(
  id serial NOT NULL,
  serial_number character varying(20) NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT devices_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE devices
  OWNER TO postgres;

CREATE UNIQUE INDEX index_devices_on_serial_number
  ON devices
  USING btree
  (serial_number COLLATE pg_catalog."default" );

patient_devices -- ~25,000 rows

患者设备——~25,000 行

CREATE TABLE patient_devices
(
  id serial NOT NULL,
  patient_id integer NOT NULL,
  device_id integer NOT NULL,
  issuance_datetime timestamp without time zone NOT NULL,
  unassignment_datetime timestamp without time zone,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT patient_devices_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE patient_devices
  OWNER TO postgres;

CREATE INDEX index_patient_devices_on_device_id
  ON patient_devices
  USING btree
  (device_id );

CREATE INDEX index_patient_devices_on_issuance_datetime
  ON patient_devices
  USING btree
  (issuance_datetime );

CREATE INDEX index_patient_devices_on_patient_id
  ON patient_devices
  USING btree
  (patient_id );

CREATE INDEX index_patient_devices_on_unassignment_datetime
  ON patient_devices
  USING btree
  (unassignment_datetime );

patients -- ~1,000 rows

患者 -- ~1,000 行

CREATE TABLE patients
(
  id serial NOT NULL,
  first_name character varying(50) NOT NULL,
  middle_name character varying(50),
  last_name character varying(50) NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL,
  CONSTRAINT participants_pkey PRIMARY KEY (id )
)
WITH (
  OIDS=FALSE
);
ALTER TABLE patients
  OWNER TO postgres;

Here is my abbreviated query.

这是我的缩写查询。

SELECT device_patients.patient_id, serial_number FROM reads
INNER JOIN devices ON devices.id = reads.device_id
INNER JOIN patient_devices ON device_patients.device_id = devices.id
WHERE (reads.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42')
AND (read_datetime > issuance_datetime) AND ((unassignment_datetime IS NOT NULL AND read_datetime < unassignment_datetime) OR
(unassignment_datetime IS NULL))
GROUP BY serial_number, patient_devices.patient_id LIMIT 10

Ultimately this will be a small part of a larger query (without the LIMIT, I only added the limit to prove to myself that the long runtime was not due to returning a bunch of rows), however I've done a bunch of experimenting and determined that this is the slow part of the larger query. When I run EXPLAIN ANALYZE on this query I get the following output (also viewable here)

最终,这将是更大查询的一小部分(没有 LIMIT,我只添加了限制以向自己证明长时间运行不是由于返回一堆行),但是我已经做了大量的实验和确定这是较大查询的慢部分。当我在这个查询上运行 EXPLAIN ANALYZE 时,我得到以下输出(也可以在这里查看)

Limit  (cost=156442.31..156442.41 rows=10 width=13) (actual time=2815.435..2815.441 rows=10 loops=1)
  ->  HashAggregate  (cost=156442.31..159114.89 rows=267258 width=13) (actual time=2815.432..2815.437 rows=10 loops=1)
        ->  Hash Join  (cost=1157.78..151455.79 rows=997304 width=13) (actual time=30.930..2739.164 rows=250150 loops=1)
              Hash Cond: (devices.device_id = devices.id)
              Join Filter: ((reads.read_datetime > patient_devices.issuance_datetime) AND (((patient_devices.unassignment_datetime IS NOT NULL) AND (reads.read_datetime < patient_devices.unassignment_datetime)) OR (patient_devices.unassignment_datetime IS NULL)))
              ->  Seq Scan on reads  (cost=0.00..7236.94 rows=255396 width=12) (actual time=0.035..64.433 rows=255450 loops=1)
                    Filter: ((read_datetime >= '2012-01-01 10:30:01'::timestamp without time zone) AND (read_datetime <= '2013-05-18 03:03:42'::timestamp without time zone))
              ->  Hash  (cost=900.78..900.78 rows=20560 width=37) (actual time=30.830..30.830 rows=25015 loops=1)
                    Buckets: 4096  Batches: 1  Memory Usage: 1755kB
                    ->  Hash Join  (cost=19.90..900.78 rows=20560 width=37) (actual time=0.776..20.551 rows=25015 loops=1)
                          Hash Cond: (patient_devices.device_id = devices.id)
                          ->  Seq Scan on patient_devices  (cost=0.00..581.93 rows=24893 width=24) (actual time=0.014..7.867 rows=25545 loops=1)
                                Filter: ((unassignment_datetime IS NOT NULL) OR (unassignment_datetime IS NULL))
                          ->  Hash  (cost=13.61..13.61 rows=503 width=13) (actual time=0.737..0.737 rows=503 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 24kB
                                ->  Seq Scan on devices  (cost=0.00..13.61 rows=503 width=13) (actual time=0.016..0.466 rows=503 loops=1)
                                      Filter: (entity_id = 2)
Total runtime: 2820.392 ms

My question is how do I speed this up? Right now I'm running this on my Windows machine for testing, but ultimately it will be deployed on Ubuntu, will that make a difference? Any insight into why this takes 2 seconds would be greatly appreciated.

我的问题是如何加快速度?现在我正在我的 Windows 机器上运行它进行测试,但最终它将部署在 Ubuntu 上,这会有所作为吗?任何有关为什么这需要 2 秒的见解将不胜感激。

Thanks

谢谢

It has been suggested that the LIMIT might be altering the query plan. Hereis the same query without the LIMIT. The slow part still appears to be the Hash Join.

有人建议 LIMIT 可能会改变查询计划。 是没有 LIMIT 的相同查询。缓慢的部分似乎仍然是哈希连接。

Also, here are the relevant tuning parameters. Again I'm only testing this on Windows now, and I don't know what effect this would have on a Linux machine

此外,这里是相关的调整参数。同样,我现在只在 Windows 上进行测试,我不知道这会对 Linux 机器产生什么影响

shared_buffers = 2GB effective_cache_size = 4GB work_mem = 256MB random_page_cost = 2.0

shared_buffers = 2GB Effective_cache_size = 4GB work_mem = 256MB random_page_cost = 2.0

Here are the statistics for the reads table

这是读取表的统计信息

Statistic   Value
Sequential Scans    130
Sequential Tuples Read  28865850
Index Scans 283630
Index Tuples Fetched    141421907
Tuples Inserted 255450
Tuples Updated  0
Tuples Deleted  0
Tuples HOT Updated  0
Live Tuples 255450
Dead Tuples 0
Heap Blocks Read    20441
Heap Blocks Hit 3493033
Index Blocks Read   8824
Index Blocks Hit    4840210
Toast Blocks Read   
Toast Blocks Hit    
Toast Index Blocks Read 
Toast Index Blocks Hit  
Last Vacuum 2013-05-20 09:23:03.782-07
Last Autovacuum 
Last Analyze    2013-05-20 09:23:03.91-07
Last Autoanalyze    2013-05-17 19:01:44.075-07
Vacuum counter  1
Autovacuum counter  0
Analyze counter 1
Autoanalyze counter 6
Table Size  27 MB
Toast Table Size    none
Indexes Size    34 MB

Here are the statistics for the devices table

这是设备表的统计信息

Statistic   Value
Sequential Scans    119
Sequential Tuples Read  63336
Index Scans 1053935
Index Tuples Fetched    1053693
Tuples Inserted 609
Tuples Updated  0
Tuples Deleted  0
Tuples HOT Updated  0
Live Tuples 609
Dead Tuples 0
Heap Blocks Read    32
Heap Blocks Hit 1054553
Index Blocks Read   32
Index Blocks Hit    2114305
Toast Blocks Read   
Toast Blocks Hit    
Toast Index Blocks Read 
Toast Index Blocks Hit  
Last Vacuum 
Last Autovacuum 
Last Analyze    
Last Autoanalyze    2013-05-17 19:02:49.692-07
Vacuum counter  0
Autovacuum counter  0
Analyze counter 0
Autoanalyze counter 2
Table Size  48 kB
Toast Table Size    none
Indexes Size    128 kB

Here are the statistics for the patient_devices table

这是patient_devices表的统计信息

Statistic   Value
Sequential Scans    137
Sequential Tuples Read  3065400
Index Scans 853990
Index Tuples Fetched    46143763
Tuples Inserted 25545
Tuples Updated  24936
Tuples Deleted  0
Tuples HOT Updated  0
Live Tuples 25547
Dead Tuples 929
Heap Blocks Read    1959
Heap Blocks Hit 6099617
Index Blocks Read   1077
Index Blocks Hit    2462681
Toast Blocks Read   
Toast Blocks Hit    
Toast Index Blocks Read 
Toast Index Blocks Hit  
Last Vacuum 
Last Autovacuum 2013-05-17 19:01:44.576-07
Last Analyze    
Last Autoanalyze    2013-05-17 19:01:44.697-07
Vacuum counter  0
Autovacuum counter  6
Analyze counter 0
Autoanalyze counter 6
Table Size  2624 kB
Toast Table Size    none
Indexes Size    5312 kB

Below is the full query that I'm trying to speed up. The smaller query is indeed faster, but I was unable to make my full query faster which is reproduced below. As suggested, I added 4 new indices, UNIQUE(device_id, issuance_datetime), UNIQUE(device_id, issuance_datetime), UNIQUE(patient_id, unassignment_datetime), UNIQUE(patient_id, unassignment_datetime)

以下是我试图加快速度的完整查询。较小的查询确实更快,但我无法使我的完整查询更快,如下所示。按照建议,我添加了 4 个新索引,UNIQUE(device_id,issue_datetime),UNIQUE(device_id,issues_datetime),UNIQUE(patient_id,unassignment_datetime),UNIQUE(patient_id,unassignment_datetime)

SELECT 
first_name
, last_name
, MAX(max_read) AS read_datetime
, SUM(value) AS value
, serial_number
FROM (
    SELECT
    pa.first_name
    , pa.last_name
    , value
    , first_value(de.serial_number) OVER(PARTITION BY pa.id ORDER BY re.read_datetime DESC) AS serial_number -- I'm not sure if this is a good way to do this, but I don't know of another way
    , re.read_datetime
    , MAX(re.read_datetime) OVER (PARTITION BY pd.id) AS max_read
    FROM reads re
    INNER JOIN devices de ON de.id = re.device_id
    INNER JOIN patient_devices pd ON pd.device_id = de.id
        AND re.read_datetime >= pd.issuance_datetime
        AND re.read_datetime < COALESCE(pd.unassignment_datetime , 'infinity'::timestamp)
    INNER JOIN patients pa ON pa.id = pd.patient_id
    WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
) AS foo WHERE read_datetime = max_read
GROUP BY first_name, last_name, serial_number ORDER BY value desc
LIMIT 10

Sorry for not posting this earlier, but I thought this query would be too complicated, and was trying to simply the problem, but apparently I still can't figure it out. It seems like it would be a LOT quicker if I could limit the results returned by the nested select using the max_read variable, but according to numerous sources, that isn't allowed in Postgres.

很抱歉没有早点发布这个,但我认为这个查询太复杂了,并试图简化问题,但显然我仍然无法弄清楚。如果我可以使用 max_read 变量限制嵌套选择返回的结果,似乎会快很多,但根据众多消息来源,这在 Postgres 中是不允许的。

采纳答案by wildplasser

FYI: sanitised query:

仅供参考:消毒查询:

SELECT pd.patient_id
        , de.serial_number
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
        AND re.read_datetime >= pd.issuance_datetime -- changed this from '>' to '>='
        AND (re.read_datetime < pd.unissuance_datetime OR pd.unissuance_datetime IS NULL)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id 
LIMIT 10
   ;

UPDATE: without the original typos:

更新:没有原来的错别字:

EXPLAIN ANALYZE
SELECT pd.patient_id
        , de.serial_number
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
        AND re.read_datetime >= pd.issuance_datetime
        AND (re.read_datetime < pd.unassignment_datetime OR pd.unassignment_datetime IS NULL)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id
LIMIT 10
  ;

UPDATE: this is about 6 times as fast here (on synthetic data, and with a slightly altered data model)

更新:这里的速度大约是这里的 6 倍(在合成数据上,并且数据模型略有改变)

-- Modified data model + synthetic data:
CREATE TABLE devices
( id serial NOT NULL
, serial_number character varying(20) NOT NULL
-- , created_at timestamp without time zone NOT NULL
-- , updated_at timestamp without time zone NOT NULL
, CONSTRAINT devices_pkey PRIMARY KEY (id )
, UNIQUE (serial_number)
) ;

CREATE TABLE reads
  -- ( id serial NOT NULL PRIMARY KEY -- You don't need this surrogate key
  (  device_id integer NOT NULL REFERENCES devices (id)
  ,  value bigint NOT NULL
  ,  read_datetime timestamp without time zone NOT NULL
  -- ,  created_at timestamp without time zone NOT NULL
  -- ,  updated_at timestamp without time zone NOT NULL
        , PRIMARY KEY ( device_id, read_datetime)
  ) ;



CREATE TABLE patient_devices
-- ( id serial NOT NULL PRIMARY KEY -- You don't need this surrogate key
( patient_id integer NOT NULL -- REFERENCES patients (id)
, device_id integer NOT NULL REFERENCES devices(id)
, issuance_datetime timestamp without time zone NOT NULL
, unassignment_datetime timestamp without time zone
-- , created_at timestamp without time zone NOT NULL
-- , updated_at timestamp without time zone NOT NULL
, PRIMARY KEY (device_id, issuance_datetime)
, UNIQUE (device_id, unassignment_datetime)
) ;


-- CREATE INDEX index_patient_devices_on_issuance_datetime ON patient_devices (device_id, unassignment_datetime );
-- may need some additional indices later

-- devices -- ~500 rows
INSERT INTO devices(serial_number) SELECT 'No_' || gs::text FROM generate_series(1,500) gs;

-- reads -- ~100K rows
INSERT INTO reads(device_id, read_datetime, value)
SELECT  de.id, gs
        , (random()*1000000)::bigint
FROM devices  de
JOIN generate_series('2012-01-01', '2013-05-01' , '1 hour' ::interval) gs
        ON random() < 0.02;

-- patient_devices -- ~25,000 rows
INSERT INTO patient_devices(device_id, issuance_datetime, patient_id)
SELECT DISTINCT ON (re.device_id, read_datetime)
        re.device_id, read_datetime, pa
FROM generate_series(1,100) pa
JOIN reads re
        ON random() < 0.01;

        -- close the open intervals
UPDATE patient_devices dst
SET unassignment_datetime = src.issuance_datetime
FROM patient_devices src
WHERE src.device_id = dst.device_id
AND src.issuance_datetime > dst.issuance_datetime
AND NOT EXISTS ( SELECT *
        FROM patient_devices nx
        WHERE nx.device_id = src.device_id
        AND nx.issuance_datetime > dst.issuance_datetime
        AND nx.issuance_datetime < src.issuance_datetime
        )
        ;

VACUUM ANALYZE patient_devices;
VACUUM ANALYZE devices;
VACUUM ANALYZE reads;


-- EXPLAIN ANALYZE
SELECT pd.patient_id
        , de.serial_number
        --, COUNT (*) AS zcount
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
        AND re.read_datetime >= pd.issuance_datetime
        AND re.read_datetime < COALESCE(pd.unassignment_datetime , 'infinity'::timestamp)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id
LIMIT 10
        ;

回答by ulkas

look at the parts of the analyzereport where you see Seq Scan.

查看analyze报告中您看到的部分Seq Scan

for example this parts could use some indexes:

例如这部分可以使用一些索引:

 Seq Scan on patient_devices - > unassignment_datetime
 Seq Scan on devices -> entity_id
 Seq Scan on reads - > read_datetime

about read_datetime: it is possible to create a specific index for mathematical equation like >and <=, which will come handy. i dont know the syntax for it, though

about read_datetime:可以为数学方程创建一个特定的索引,比如><=,这会派上用场。我不知道它的语法,虽然