PostgreSQL:NOT IN 与 EXCEPT 性能差异(编辑 #2)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7125291/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 23:13:35  来源:igfitidea点击:

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

sqlpostgresql

提问by Daniel Lyons

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.

我有两个功能相同的查询。其中一个表现非常好,另一个表现非常差。我看不出性能差异从何而来。

Query #1:

查询#1:

SELECT id 
FROM subsource_position
WHERE
  id NOT IN (SELECT position_id FROM subsource)

This comes back with the following plan:

这回来了以下计划:

                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Seq Scan on subsource_position  (cost=0.00..362486535.10 rows=128524 width=4)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..2566.50 rows=101500 width=4)
           ->  Seq Scan on subsource  (cost=0.00..1662.00 rows=101500 width=4)

Query #2:

查询#2:

SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;

Plan:

计划:

                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 SetOp Except  (cost=24760.35..25668.66 rows=95997 width=4)
   ->  Sort  (cost=24760.35..25214.50 rows=181663 width=4)
         Sort Key: "*SELECT* 1".id
         ->  Append  (cost=0.00..6406.26 rows=181663 width=4)
               ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..4146.94 rows=95997 width=4)
                     ->  Seq Scan on subsource_position  (cost=0.00..3186.97 rows=95997 width=4)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2259.32 rows=85666 width=4)
                     ->  Seq Scan on subsource  (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)

I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT INto optimize well; is NOT INalways a performance problem or is there a reason it does not optimize here?

我有一种感觉,我遗漏了我的一个查询中明显不好的东西,或者我错误地配置了 PostgreSQL 服务器。我本来希望这NOT IN能很好地优化;是NOT IN始终存在性能问题,还是有它在这里不优化的理由?

Additional data:

附加数据:

=> select count(*) from subsource;
 count 
-------
 85158
(1 row)

=> select count(*) from subsource_position;
 count 
-------
 93261
(1 row)

Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.

编辑:我现在已经修复了下面提到的 AB != BA 问题。但是我所说的问题仍然存在:查询 #1 仍然比查询 #2 严重得多。我相信这是因为两个表的行数相似。

Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.

编辑 2:我使用的是 PostgresQL 9.0.4。我不能使用 EXPLAIN ANALYZE,因为查询 #1 花费的时间太长。所有这些列都不是 NULL,因此应该没有区别。

Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

编辑 3:我在这两列上都有一个索引。我还没有完成查询 #1(大约 10 分钟后放弃)。查询#2 立即返回。

采纳答案by Magnus Hagander

Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.

由于您使用默认配置运行,请尝试增加 work_mem。最有可能的是,子查询最终会被假脱机到磁盘,因为您只允许 1Mb 的工作内存。尝试 10 或 20mb。

回答by Antony Gibbs

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).

查询 #1 不是执行此操作的优雅方式... (NOT) IN SELECT 适用于一些条目,但它不能使用索引 ( Seq Scan)。

Before having EXCEPT... this is how it was done using a JOIN (HASH JOIN):

在有 EXCEPT 之前……这是使用 JOIN ( HASH JOIN) 完成的方式:

    SELECT sp.id
    FROM subsource_position AS sp
        LEFT JOIN subsource AS s ON (s.postion_id = sp.id)
    WHERE
        s.postion_id IS NULL

EXCEPT appeared in Postgres long, long time ago... But for exemple, using MySQL I believe this is still the only way to achieve this using index junctions.

EXCEPT 很久很久以前就出现在 Postgres 中……但是例如,使用 MySQL 我相信这仍然是使用索引连接实现这一目标的唯一方法。

回答by mu is too short

Your queries are not functionally equivalent so any comparison of their query plans is meaningless.

您的查询在功能上并不等效,因此对其查询计划的任何比较都是毫无意义的。

Your first query is, in set theory terms, this:

你的第一个查询,用集合论的术语来说,是这样的:

{subsource.position_id} - {subsource_position.id}
          ^        ^                ^        ^

but your second is this:

但你的第二个是:

{subsource_position.id} - {subsource.position_id}
          ^        ^                ^        ^

And A - Bis not the same as B - Afor arbitrary sets Aand B.

而且A - B是不一样的B - A任意套AB

Fix your queries to be semantically equivalent and try again.

修复您的查询在语义上是等效的,然后再试一次。

回答by Barry Kelly

If idand position_idare both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.

如果idposition_id都被索引(在它们自己的或多列索引中的第一列),那么两个索引扫描都是必要的——这是一个基于排序合并的简单集合算法。

Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.

我个人认为 PostgreSQL 根本没有优化智能来理解这一点。

(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -uon the command line in seconds. Database less than 50MB when exported with pg_dump.)

(我是在诊断出一个运行超过 24 小时的查询后提出这个问题的,我可以sort x y y | uniq -u在几秒钟内在命令行上执行该查询。使用 pg_dump 导出时数据库小于 50MB。)

PS: more interesting comment here:

PS:这里更有趣的评论:

more work has been put into optimizing EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially less useful due to its unintuitive but spec-mandated handling of NULLs. We're not going to apologize for that, and we're not going to regard it as a bug.

与 NOT IN 相比,优化 EXCEPT 和 NOT EXISTS 的工作更多,因为后者由于其不直观但规范规定的 NULL 处理而几乎没有用。我们不会为此道歉,也不会将其视为错误。

What it comes down to is that exceptis different to not inwith respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.

什么它归结为是except对不同的not in相对于空处理。我没有查过细节,但这意味着 PostgreSQL(积极地)没有优化它。

回答by Thomas Berger

The second query makes usage of the HASH JOINfeature of postgresql. This is much faster then the Seq Scanof the first one.

第二个查询利用HASH JOIN了 postgresql的特性。这比Seq Scan第一个快得多。