SQL sum() 与 count()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14998225/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 13:48:23  来源:igfitidea点击:

sum() vs. count()

sqlpostgresqlaggregate-functions

提问by ryanrhee

Consider a voting system implemented in PostgreSQL, where each user can vote up or down on a "foo". There is a footable that stores all the "foo information", and a votestable that stores the user_id, foo_id, and vote, where voteis +1 or -1.

考虑一个在 PostgreSQL 中实现的投票系统,其中每个用户都可以对“foo”投赞成票或反对票。有一个foo表,用于存储所有的“富信息”,以及votes表存储user_idfoo_idvote,其中vote是+1或-1。

To get the vote tally for each foo, the following query would work:

要获得每个 foo 的投票数,可以使用以下查询:

SELECT sum(vote) FROM votes WHERE foo.foo_id = votes.foo_id;

But, the following would work just as well:

但是,以下也同样有效:

(SELECT count(vote) FROM votes 
 WHERE foo.foo_id = votes.foo_id 
 AND votes.vote = 1)
- (SELECT count(vote) FROM votes 
   WHERE foo.foo_id = votes.foo_id 
   AND votes.vote = (-1))

I currently have an index on votes.foo_id.

我目前在votes.foo_id.

Which is a more efficient approach? (In other words, which would run faster?) I'm interested in both the PostgreSQL-specific answer and the general SQL answer.

哪种方法更有效?(换句话说,哪个会运行得更快?)我对 PostgreSQL 特定的答案和一般的 SQL 答案都感兴趣。

EDIT

编辑

A lot of answers have been taking into account the case where voteis null. I forgot to mention that there is a NOT NULLconstraint on the vote column.

很多答案都考虑到了votenull的情况。我忘了提到NOT NULL投票栏有一个限制。

Also, many have been pointing out that the first is much easier to read. Yes, it is definitely true, and if a colleague wrote the 2nd one, I would be exploding with rage unless there was a performance necessity. Never the less, the question is still on the performance of the two. (Technically, if the first query was wayslower, it wouldn't be such a crime to write the second query.)

此外,许多人指出第一个更容易阅读。是的,这绝对是真的,如果一个同事写了第二个,除非有表演的必要,否则我会气炸了。尽管如此,问题仍然在于两人的表现。(从技术上来说,如果第一个查询方式比较慢,它不会是这种罪行写入第二个查询。)

回答by Erwin Brandstetter

Of course, the first example is faster, simpler and easier to read. Should be obvious even before one gets slapped with aquatic creatures. While sum()is slightly more expensive than count(), what matters much, much more is that the second example need two scans.

当然,第一个例子更快、更简单、更容易阅读。甚至在被水生生物打耳光之前就应该很明显。虽然sum()比 稍贵count(),但重要的是,第二个示例需要两次扫描。

But there is an actual difference, too: sum()can return NULLwhere count()doesn't. I quote the manual on aggregate functions:

但是,有一个实际的差别,也:sum()可以回到NULL这里count()没有。我引用了聚合函数手册

It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect,

需要注意的是,除了 count 之外,这些函数在没有选择任何行时都返回一个空值。特别是,没有行的总和返回空值,而不是像人们期望的那样为零,

Since you seem to have a weak spot for performance optimization, here's a detail you might like: count(*)is slightly faster than count(vote). Only equivalent if vote is NOT NULL. Test performance with EXPLAIN ANALYZE.

由于您似乎在性能优化方面存在弱点,因此您可能会喜欢这里的一个细节:count(*)count(vote). 仅当投票为 时才等效NOT NULL。测试性能EXPLAIN ANALYZE

On closer inspection

仔细检查

Both queries are syntactical nonsense, standing alone. It only makes sense if you copied them from the SELECTlist of a bigger query like:

这两个查询都是句法上的废话,单独存在。只有从SELECT更大的查询列表中复制它们才有意义,例如:

SELECT *, (SELECT sum(vote) FROM votes WHERE votes.foo_id = foo.foo_id)
FROM   foo;

The important point here is the correlated subquery - which may be fine if you are only reading a small fractionof votesin your query. We would see additional WHEREconditions, and you should have matching indexes.

这里的重点是相关子查询 - 如果您只读取查询中的一小部分,这可能没votes问题。我们会看到额外的WHERE条件,你应该有匹配的索引。

In Postgres 9.3 or later, the alternative, cleaner, 100 % equivalent solution would be with LEFT JOIN LATERAL ... ON true:

在 Postgres 9.3 或更高版本中,替代的、更干净的、100% 等效的解决方案是LEFT JOIN LATERAL ... ON true

SELECT *
FROM   foo f
LEFT   JOIN LATERAL (
   SELECT sum(vote) FROM votes WHERE foo_id = f.foo_id
   ) v ON true;

Typically similar performance. Details:

通常类似的性能。细节:

However, while reading large parts or allfrom table votes, this will be (much) faster:

但是,在从 table读取大部分或全部votes,这将(快得多)快:

SELECT f.*, v.score
FROM   foo f
JOIN   (
   SELECT foo_id, sum(vote) AS score
   FROM   votes
   GROUP  BY 1
   ) v USING (foo_id);

Aggregate values in a subquery first, then join to the result.
About USING:

首先聚合子查询中的值,然后加入结果。
关于USING

回答by MatheusOl

The first one will be faster. You can try it on a simple way.

第一个会更快。你可以用简单的方法试试。

Generate some data:

生成一些数据:

CREATE TABLE votes(foo_id integer, vote integer);
-- Insert 1000000 rows into 100 foos (1 to 100)
INSERT INTO votes SELECT round(random()*99)+1, CASE round(random()) WHEN 0 THEN -1 ELSE 1 END FROM generate_series(1, 1000000);
CREATE INDEX idx_votes_id ON votes (foo_id);

Check both

检查两者

EXPLAIN ANALYZE SELECT SUM(vote) FROM votes WHERE foo_id = 5;
EXPLAIN ANALYZE SELECT (SELECT COUNT(*) AS count FROM votes WHERE foo_id=5 AND vote=1) - (SELECT COUNT(*)*-1 AS count FROM votes WHERE foo_id=5 AND vote=-1);

But the truth is that they are not equivalent, to make sure the first one will work as the second, you need to treat for the nullcase:

但事实是它们并不等效,为了确保第一个可以作为第二个工作,您需要针对这种null情况进行处理:

SELECT COALESCE(SUM(vote), 0) FROM votes WHERE foo_id = 5;

One more thing. If you are using PostgreSQL 9.2, you can create your index with both columns in it, and that way you can have a chance of using index-only scan:

还有一件事。如果您使用的是 PostgreSQL 9.2,您可以创建包含两列的索引,这样您就有机会使用仅索引扫描:

CREATE INDEX idx_votes_id ON votes (foo_id, vote);

BUT! In some situations this index may be worst, so you should try with both and run EXPLAIN ANALYZEto see which one is the best, or even create both and check which one PostgreSQL is using most (and exclude the other).

但!在某些情况下,这个索引可能是最糟糕的,所以你应该同时尝试并运行EXPLAIN ANALYZE以查看哪个是最好的,或者甚至创建两个并检查哪个 PostgreSQL 使用最多(并排除另一个)。

回答by Mike

I would expect the first query to work faster as this is a single query and it's more readable (handy in case you'd have to get back to this after some time).

我希望第一个查询能够更快地工作,因为这是一个单一的查询并且它更具可读性(如果您必须在一段时间后返回此查询,则很方便)。

Second query consists of two queries. You only get a result as if it was a single query.

第二个查询由两个查询组成。你只会得到一个结果,就好像它是一个单一的查询。

That said, to be absolutely sure which of these works better for you I would populate both tables with lots of dummy data and check the query execution time.

也就是说,为了绝对确定哪个更适合您,我会用大量虚拟数据填充两个表并检查查询执行时间。