postgresql COUNT(DISTINCT ...) 很慢

Question

提问by ferson2020

I have a very simple SQL query:

我有一个非常简单的 SQL 查询：

SELECT COUNT(DISTINCT x) FROM table;

My table has about 1.5 million rows. This query is running pretty slowly; it takes about 7.5s, compared to

我的表有大约 150 万行。这个查询运行得很慢；大约需要 7.5 秒，相比之下

 SELECT COUNT(x) FROM table;

which takes about 435ms. Is there any way to change my query to improve performance? I've tried grouping and doing a regular count, as well as putting an index on x; both have the same 7.5s execution time.

大约需要 435 毫秒。有什么方法可以更改我的查询以提高性能吗？我尝试过分组并进行常规计数，并在 x 上放置索引；两者都有相同的 7.5 秒执行时间。

Answer 1

回答by Ankur

You can use this:

你可以使用这个：

SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;

This is much faster than:

这比以下快得多：

COUNT(DISTINCT column_name)

Answer 2

回答by wildplasser

-- My default settings (this is basically a single-session machine, so work_mem is pretty high)
SET effective_cache_size='2048MB';
SET work_mem='16MB';

\echo original
EXPLAIN ANALYZE
SELECT
        COUNT (distinct val) as aantal
FROM one
        ;

\echo group by+count(*)
EXPLAIN ANALYZE
SELECT
        distinct val
       -- , COUNT(*)
FROM one
GROUP BY val;

\echo with CTE
EXPLAIN ANALYZE
WITH agg AS (
    SELECT distinct val
    FROM one
    GROUP BY val
    )
SELECT COUNT (*) as aantal
FROM agg
        ;

Results:

结果：

original                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36448.06..36448.07 rows=1 width=4) (actual time=1766.472..1766.472 rows=1 loops=1)
   ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=31.371..185.914 rows=1499845 loops=1)
 Total runtime: 1766.642 ms
(3 rows)

group by+count(*)
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=412.470..412.598 rows=1300 loops=1)
   ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=412.066..412.203 rows=1300 loops=1)
         ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=26.134..166.846 rows=1499845 loops=1)
 Total runtime: 412.686 ms
(4 rows)

with CTE
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36506.56..36506.57 rows=1 width=0) (actual time=408.239..408.239 rows=1 loops=1)
   CTE agg
     ->  HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=407.704..407.847 rows=1300 loops=1)
           ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=407.320..407.467 rows=1300 loops=1)
                 ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=24.321..165.256 rows=1499845 loops=1)
       ->  CTE Scan on agg  (cost=0.00..26.00 rows=1300 width=0) (actual time=407.707..408.154 rows=1300 loops=1)
     Total runtime: 408.300 ms
    (7 rows)

The same plan as for the CTE could probably also be produced by other methods (window functions)

与 CTE 相同的计划可能也可以通过其他方法（窗口函数）生成

Answer 3

回答by Tometzky

If your count(distinct(x))is significantly slower than count(x)then you can speed up this query by maintaining x value counts in different table, for example table_name_x_counts (x integer not null, x_count int not null), using triggers. But your write performance will suffer and if you update multiple xvalues in single transaction then you'd need to do this in some explicit order to avoid possible deadlock.

如果您count(distinct(x))的速度明显慢于count(x)那么您可以通过维护不同表中的 x 值计数来加速此查询，例如table_name_x_counts (x integer not null, x_count int not null)，使用触发器。但是您的写入性能会受到影响，如果您x在单个事务中更新多个值，那么您需要以某种明确的顺序执行此操作以避免可能的死锁。

Answer 4

回答by Rana Pratap Singh

I was also searching same answer, because at some point of time I needed total_count with distinct values along with limit/offset.

我也在寻找相同的答案，因为在某些时候我需要total_count 与不同的值以及 limit/offset。

Because it's little tricky to do- To get total count with distinct values along with limit/offset. Usually it's hard to get total count with limit/offset. Finally I got the way to do -

因为这有点棘手 - 获得具有不同值以及限制/偏移量的总数。通常很难获得限制/偏移的总数。终于我找到了办法——

SELECT DISTINCT COUNT(*) OVER() as total_count, * FROM table_name limit 2 offset 0;

Query performance is also high.

查询性能也很高。

postgresql COUNT(DISTINCT ...) 很慢

提问by ferson2020

回答by Ankur

回答by wildplasser

回答by Tometzky

回答by Rana Pratap Singh

相关推荐

最近更新

标签

postgresql COUNT(DISTINCT ...) 很慢

提问by ferson2020

回答by Ankur

回答by wildplasser

回答by Tometzky

回答by Rana Pratap Singh

相关推荐

postgresql 如何让 pg_dump 正确进行身份验证

postgresql 将字符串类型转换为整数 - Postgres

postgresql 错误：运算符不存在：字符变化 = bytea

postgresql plpgsql：调用带有 2 个 OUT 参数的函数

相关推荐

最近更新

标签