SQL 在 PostgreSQL 中发现表的行数的快速方法

Question

提问by Renato Dinhani

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

我需要知道表中的行数来计算百分比。如果总计数大于某个预定义的常量，我将使用常量值。否则，我将使用实际的行数。

I can use SELECT count(*) FROM table. But if my constant value is 500,000and I have 5,000,000,000rows in my table, counting all rows will waste a lot of time.

我可以使用SELECT count(*) FROM table. 但是如果我的常量值为500,000，并且我的表中有5,000,000,000行，那么计算所有行会浪费很多时间。

Is it possible to stop counting as soon as my constant value is surpassed?

一旦超过我的常数值，是否可以停止计数？

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

我只需要确切的行数，只要它低于给定的限制。否则，如果计数高于限制，我将使用限制值并希望尽快得到答案。

Something like this:

像这样的东西：

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

Answer 1

回答by Erwin Brandstetter

Counting rows in bigtables is known to be slow in PostgreSQL. To get a precise number it has to do a full count of rows due to the nature of MVCC. There is a way to speed this up dramaticallyif the count does nothave to be exactlike it seems to be in your case.

众所周知，在 PostgreSQL 中计算大表中的行数很慢。由于MVCC的性质，为了获得精确的数字，它必须对行进行完整计数。有一种方法来大大加快这如果计数也没有必须要确切喜欢它似乎是在你的情况。

Instead of getting the exactcount (slowwith big tables):

而不是获得确切的计数（大表慢）：

SELECT count(*) AS exact_count FROM myschema.mytable;

You get a close estimate like this (extremely fast):

你会得到这样的近似估计（非常快）：

SELECT reltuples::bigint AS estimate FROM pg_class where relname='mytable';

How close the estimate is depends on whether you run ANALYZEenough. It is usually very close.
See the PostgreSQL Wiki FAQ.
Or the dedicated wiki page for count(*) performance.

估计值的接近程度取决于您运行的次数是否ANALYZE足够。它通常非常接近。
请参阅PostgreSQL Wiki 常见问题解答。
或用于 count(*) 性能的专用 wiki 页面。

Better yet

更好

The article in the PostgreSQL Wiki iswas a bit sloppy. It ignored the possibility that there can be multiple tables of the same name in one database - in different schemas. To account for that:

PostgreSQL的维基文章的是一个有点草率。它忽略了在一个数据库中可能有多个同名表的可能性 - 在不同的模式中。考虑到这一点：

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema'

Or better still

或者更好

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

更快、更简单、更安全、更优雅。请参阅有关对象标识符类型的手册。

Use to_regclass('myschema.mytable')in Postgres 9.4+ to avoid exceptions for invalid table names:

to_regclass('myschema.mytable')在 Postgres 9.4+ 中使用以避免无效表名的异常：

How to check if a table exists in a given schema

如何检查给定模式中是否存在表

`TABLESAMPLE SYSTEM (n)`in Postgres 9.5+

`TABLESAMPLE SYSTEM (n)`在 Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse commented, the newly added clause for the SELECTcommand might be useful if statistics in pg_classare not current enough for some reason. For example:

就像@a_horse 评论的那样，SELECT如果pg_class由于某种原因统计信息不够最新，则新添加的命令子句可能会很有用。例如：

No autovacuumrunning.
Immediately after a big INSERTor DELETE.
TEMPORARYtables (which are not covered by autovacuum).

没有autovacuum跑步。
紧接着一个大INSERT或DELETE。
TEMPORARY表（不包括在中autovacuum）。

This only looks at a random n% (1in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

这仅查看随机n%（1在示例中）选择的块并计算其中的行数。更大的样本会增加成本并减少错误，您的选择。准确性取决于更多因素：

Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTORoccupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.

行大小的分布。如果给定的块碰巧包含比通常更宽的行，则计数低于通常等。
死元组或FILLFACTOR每个块的占用空间。如果表中分布不均，则估计值可能会偏离。
一般舍入误差。

In most cases the estimate from pg_classwill be faster and more accurate.

在大多数情况下，估计值pg_class会更快、更准确。

Answer to actual question

回答实际问题

First, I need to know the number of rows in that table, if the total count is greater than some predefined constant,

首先，我需要知道该表中的行数，如果总计数大于某个预定义的常量，

And whether it ...

以及是否...

... is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

...在计数通过我的常量值的那一刻是可能的，它将停止计数（而不是等待完成计数以通知行计数更大）。

Yes.You can use a subquery with LIMIT:

是的。您可以使用子查询LIMIT：

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres actually stops countingbeyond the given limit, you get an exact and currentcount for up to nrows (500000 in the example), and notherwise. Not nearly as fast as the estimate in pg_class, though.

Postgres实际上会停止超出给定限制的计数，您可以获得最多n行（示例中为 500000）的准确和当前计数，否则为n。不过，没有中的估计那么快。pg_class

Answer 2

回答by Flimzy

I did this once in a postgres app by running:

我通过运行在 postgres 应用程序中做过一次：

EXPLAIN SELECT * FROM foo;

Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:

然后使用正则表达式或类似逻辑检查输出。对于一个简单的 SELECT *，输出的第一行应该是这样的：

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

You can use the rows=(\d+)value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*)if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).

您可以使用该rows=(\d+)值作为将返回的行数的粗略估计值，然后仅SELECT COUNT(*)在估计值小于阈值的 1.5 倍（或您认为对您的应用程序有意义的任何数字）时才进行实际估计。

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

根据查询的复杂程度，这个数字可能会越来越不准确。事实上，在我的应用程序中，当我们添加连接和复杂条件时，它变得非常不准确，完全没有价值，即使知道在 100 的幂内我们会返回多少行，所以我们不得不放弃该策略。

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.

但是，如果您的查询足够简单，以至于 Pg 可以在合理的误差范围内预测它将返回多少行，那么它可能对您有用。

Answer 3

回答by SuperNova

You can get the count by the below query (without * or any column names).

您可以通过以下查询获取计数（不带 * 或任何列名）。

select from table_name;

Answer 4

回答by Anvesh

Reference taken from this Blog.

参考来自此博客。

You can use below to query to find row count.

您可以使用下面的查询来查找行数。

Using pg_class:

使用 pg_class：

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

Using pg_stat_user_tables:

使用 pg_stat_user_tables：

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;

Answer 5

回答by Ritesh

In Oracle, you could use rownumto limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*)then:

在 Oracle 中，您可以使用rownum限制返回的行数。我猜其他 SQL 中也存在类似的构造。因此，对于您给出的示例，您可以将返回的行数限制为 500001，count(*)然后应用：

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)

Answer 6

回答by Chris Bednarski

How wide is the text column?

文本栏有多宽？

With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).

使用 GROUP BY，您无法避免数据扫描（至少是索引扫描）。

I'd recommend:

我建议：

If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column. Again, this is to decrease the workload (scan through a narrow column index)

如果可能，更改架构以删除重复的文本数据。这样，计数将发生在“许多”表中的狭窄外键字段上。
或者，使用文本的 HASH 创建一个生成的列，然后 GROUP BY 散列列。同样，这是为了减少工作量（扫描窄列索引）

Edit:

编辑：

Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.

您的原始问题与您的编辑不太匹配。我不确定您是否知道 COUNT 与 GROUP BY 一起使用时，将返回每个组的项目数，而不是整个表中的项目数。

Answer 7

回答by DrKoch

For SQL Server (2005 or above) a quick and reliablemethod is:

对于 SQL Server（2005 或更高版本），一种快速可靠的方法是：

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

Details about sys.dm_db_partition_stats are explained in MSDN

有关 sys.dm_db_partition_stats 的详细信息在MSDN 中有说明

The query adds rows from all parts of a (possibly) partitioned table.

该查询从（可能）分区表的所有部分添加行。

index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)

index_id=0 是无序表（堆），index_id=1 是有序表（聚集索引）

Even faster (but unreliable) methods are detailed here.

这里详细介绍了更快（但不可靠）的方法。

SQL 在 PostgreSQL 中发现表的行数的快速方法

提问by Renato Dinhani

回答by Erwin Brandstetter

Better yet

更好

Or better still

或者更好

`TABLESAMPLE SYSTEM (n)`in Postgres 9.5+

`TABLESAMPLE SYSTEM (n)`在 Postgres 9.5+

Answer to actual question

回答实际问题

回答by Flimzy

回答by SuperNova

回答by Anvesh

回答by Ritesh

回答by Chris Bednarski

回答by DrKoch

相关推荐

最近更新

标签

SQL 在 PostgreSQL 中发现表的行数的快速方法

提问by Renato Dinhani

回答by Erwin Brandstetter

Better yet

更好

Or better still

或者更好

TABLESAMPLE SYSTEM (n)in Postgres 9.5+

TABLESAMPLE SYSTEM (n)在 Postgres 9.5+

Answer to actual question

回答实际问题

回答by Flimzy

回答by SuperNova

回答by Anvesh

回答by Ritesh

回答by Chris Bednarski

回答by DrKoch

相关推荐

SQL 使用从同一表计算的值更新表的最佳方法

SQL 带有分组依据的数据透视表

SQL DB2 需要有关更改表更改列集数据类型的帮助

SQL Server 2008 列的千位分隔符

相关推荐

最近更新

标签

`TABLESAMPLE SYSTEM (n)`in Postgres 9.5+

`TABLESAMPLE SYSTEM (n)`在 Postgres 9.5+