SQL PostgreSQL - 获取具有列最大值的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/586781/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PostgreSQL - fetch the row which has the Max value for a column
提问by Joshua Berry
I'm dealing with a Postgres table (called "lives") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that will give me the most recent lives_remaining total for each usr_id
我正在处理一个 Postgres 表(称为“lives”),其中包含带有 time_stamp、usr_id、transaction_id 和 life_remaining 列的记录。我需要一个查询,该查询将为我提供每个 usr_id 的最新生命总数
- There are multiple users (distinct usr_id's)
- time_stamp is not a unique identifier: sometimes user events (one by row in the table) will occur with the same time_stamp.
- trans_id is unique only for very small time ranges: over time it repeats
- remaining_lives (for a given user) can both increase and decrease over time
- 有多个用户(不同的 usr_id)
- time_stamp 不是唯一标识符:有时用户事件(在表中逐行)会以相同的 time_stamp 发生。
- trans_id 仅在非常小的时间范围内是唯一的:随着时间的推移它会重复
- 剩余生命(对于给定用户)可以随着时间的推移而增加和减少
example:
例子:
time_stamp|lives_remaining|usr_id|trans_id ----------------------------------------- 07:00 | 1 | 1 | 1 09:00 | 4 | 2 | 2 10:00 | 2 | 3 | 3 10:00 | 1 | 2 | 4 11:00 | 4 | 1 | 5 11:00 | 3 | 1 | 6 13:00 | 3 | 3 | 1
As I will need to access other columns of the row with the latest data for each given usr_id, I need a query that gives a result like this:
由于我需要使用每个给定 usr_id 的最新数据访问行的其他列,因此我需要一个查询,该查询提供如下结果:
time_stamp|lives_remaining|usr_id|trans_id ----------------------------------------- 11:00 | 3 | 1 | 6 10:00 | 1 | 2 | 4 13:00 | 3 | 3 | 1
As mentioned, each usr_id can gain or lose lives, and sometimes these timestamped events occur so close together that they have the same timestamp! Therefore this query won't work:
如前所述,每个 usr_id 都可能获得或失去生命,有时这些带时间戳的事件发生得非常接近,以至于它们具有相同的时间戳!因此此查询将不起作用:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp) AS max_timestamp
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp = b.time_stamp
Instead, I need to use both time_stamp (first) and trans_id (second) to identify the correct row. I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows. This is the hacked up query that I've gotten to work:
相反,我需要同时使用 time_stamp(第一个)和 trans_id(第二个)来标识正确的行。然后,我还需要将该信息从子查询传递到主查询,该查询将为相应行的其他列提供数据。这是我开始工作的黑客查询:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp || '*' || trans_id)
AS max_timestamp_transid
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id
ORDER BY b.usr_id
Okay, so this works, but I don't like it. It requires a query within a query, a self join, and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans_id. The table "lives" has tens of millions of rows to parse, so I'd like this query to be as fast and efficient as possible. I'm new to RDBM and Postgres in particular, so I know that I need to make effective use of the proper indexes. I'm a bit lost on how to optimize.
好的,所以这有效,但我不喜欢它。它需要一个查询中的查询,一个自连接,在我看来,通过抓取 MAX 发现具有最大时间戳和 trans_id 的行可能会更简单。表“lives”有数千万行要解析,所以我希望这个查询尽可能快速有效。我是 RDBM 和 Postgres 的新手,所以我知道我需要有效地使用正确的索引。我对如何优化有点迷茫。
I found a similar discussion here. Can I perform some type of Postgres equivalent to an Oracle analytic function?
我在这里找到了类似的讨论。我可以执行某种类型的 Postgres 等同于 Oracle 分析功能吗?
Any advice on accessing related column information used by an aggregate function (like MAX), creating indexes, and creating better queries would be much appreciated!
关于访问聚合函数(如 MAX)使用的相关列信息、创建索引和创建更好的查询的任何建议将不胜感激!
P.S. You can use the following to create my example case:
PS您可以使用以下内容来创建我的示例案例:
create TABLE lives (time_stamp timestamp, lives_remaining integer,
usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);
采纳答案by vladr
On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id
uniformly distributed between 0 and 30),
在一个有 158k 伪随机行的表上(usr_id 均匀分布在 0 到 10ktrans_id
之间,均匀分布在 0 到 30 之间),
By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost
values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"
通过查询成本,下面,我指的是 Postgres 的基于成本的优化器的成本估算(使用 Postgres 的默认xxx_cost
值),这是对所需 I/O 和 CPU 资源的加权函数估算;您可以通过启动 PgAdminIII 并在“查询/解释选项”设置为“分析”的查询上运行“查询/解释(F7)”来获得它
- Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (
usr_id
,trans_id
,time_stamp
)) - Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (
usr_id
,trans_id
)) - Query #1 belowhas a cost estimate of 16k, and completes in 800ms (given a compound index on (
usr_id
,trans_id
,time_stamp
)) - Query #2 belowhas a cost estimate of 14k, and completes in 800ms (given a compound function index on (
usr_id
,EXTRACT(EPOCH FROM time_stamp)
,trans_id
))- this is Postgres-specific
- Query #3 below(Postgres 8.4+) has a cost estimate and completion time comparable to (or better than) query #2 (given a compound index on (
usr_id
,time_stamp
,trans_id
)); it has the advantage of scanning thelives
table only once and, should you temporarily increase (if needed) work_memto accommodate the sort in memory, it will be by far the fastest of all queries.
- Quassnoy 的查询成本估计为 745k (!),并在 1.3 秒内完成(给定 (
usr_id
,trans_id
,time_stamp
)上的复合索引) - Bill 的查询的成本估计为 93k,并在 2.9 秒内完成(给定 (
usr_id
,trans_id
)上的复合索引) - 下面的查询 #1的成本估计为 16k,并在 800 毫秒内完成(给定 (
usr_id
,trans_id
,time_stamp
)上的复合索引) - 下面的查询 #2的成本估计为 14k,并在 800 毫秒内完成(给定 (
usr_id
,EXTRACT(EPOCH FROM time_stamp)
,trans_id
)上的复合函数索引)- 这是 Postgres 特有的
- 下面的查询 #3(Postgres 8.4+) 的成本估算和完成时间与查询 #2 相当(或更好)(给定 (
usr_id
,time_stamp
,trans_id
)上的复合索引);它的优点是lives
只扫描一次表,如果你临时增加(如果需要)work_mem以适应内存中的排序,它将是迄今为止所有查询中最快的。
All times above include retrieval of the full 10k rows result-set.
以上所有时间都包括检索完整的 10k 行结果集。
Your goal is minimal cost estimate andminimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.
您的目标是最小成本估算和最短查询执行时间,重点是估算成本。查询执行可能在很大程度上取决于运行时条件(例如,相关行是否已经完全缓存在内存中),而成本估算则不然。另一方面,请记住,成本估算就是估算。
The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a muchhigher cost, it will generally be wiser to choose the one with higher execution time but lower cost.
最佳查询执行时间是在没有负载的专用数据库上运行时获得的(例如在开发 PC 上使用 pgAdminIII)。查询时间在生产中会根据实际机器负载/数据访问分布而有所不同。当一个查询稍快出现(<20%)比其它但是具有多更高的成本,这将通常是明智的选择具有较高的执行时间,但成本更低。
When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereasthe query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:
当您期望在查询运行时您的生产机器上不会有内存竞争(例如,RDBMS 缓存和文件系统缓存不会被并发查询和/或文件系统活动破坏),那么您获得的查询时间在独立(例如开发 PC 上的 pgAdminIII)模式下将具有代表性。如果生产系统上存在争用,查询时间将与估计的成本比率成比例地降低,因为成本较低的查询不太依赖缓存,而成本较高的查询将一遍又一遍地重新访问相同的数据(触发在没有稳定缓存的情况下额外的 I/O),例如:
cost | time (dedicated machine) | time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |
Do not forget to run ANALYZE lives
once after creating the necessary indices.
不要忘记ANALYZE lives
在创建必要的索引后运行一次。
Query #1
查询#1
-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max
Query #2
查询#2
-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]
2013/01/29 update
2013/01/29 更新
Finally, as of version 8.4, Postgres supports Window Functionmeaning you can write something as simple and efficient as:
最后,从 8.4 版本开始,Postgres 支持窗口函数,这意味着您可以编写一些简单而高效的内容:
Query #3
查询 #3
-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
回答by Marco
回答by Bill Karwin
Here's another method, which happens to use no correlated subqueries or GROUP BY. I'm not expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions given by other folks to see which works better for you.
这是另一种方法,它恰好不使用相关子查询或 GROUP BY。我不是 PostgreSQL 性能调优方面的专家,所以我建议你同时尝试这个和其他人提供的解决方案,看看哪个更适合你。
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
I am assuming that trans_id
is unique at least over any given value of time_stamp
.
我假设它trans_id
至少在time_stamp
.
回答by j_random_hacker
I like the style of Mike Woodhouse's answeron the other page you mentioned. It's especially concise when the thing being maximised over is just a single column, in which case the subquery can just use MAX(some_col)
and GROUP BY
the other columns, but in your case you have a 2-part quantity to be maximised, you can still do so by using ORDER BY
plus LIMIT 1
instead (as done by Quassnoi):
我喜欢迈克伍德豪斯在您提到的另一页上的回答风格。这是特别简洁,当被最大化了的东西只是一列,在这种情况下,子查询可以只使用MAX(some_col)
和GROUP BY
其他列,但在你的情况下,你必须要最大化的2个部分量,你仍然可以通过使用这样做ORDER BY
加上LIMIT 1
(如 Quassnoi 所做的那样):
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
I find using the row-constructor syntax WHERE (a, b, c) IN (subquery)
nice because it cuts down on the amount of verbiage needed.
我发现使用 row-constructor 语法WHERE (a, b, c) IN (subquery)
很好,因为它减少了所需的措辞量。
回答by burak emre
Actaully there's a hacky solution for this problem. Let's say you want to select the biggest tree of each forest in a region.
Actaully有一个解决这个问题的hacky解决方案。假设您要选择区域中每个森林中最大的树。
SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
FROM tree JOIN forest ON (tree.forest = forest.id)
GROUP BY forest.id
When you group trees by forests there will be an unsorted list of trees and you need to find the biggest one. First thing you should do is to sort the rows by their sizes and select the first one of your list. It may seems inefficient but if you have millions of rows it will be quite faster than the solutions that includes JOIN
's and WHERE
conditions.
当您按森林对树木进行分组时,将会有一个未分类的树木列表,您需要找到最大的一个。您应该做的第一件事是按行大小对行进行排序,然后选择列表中的第一个。这似乎效率低下,但如果您有数百万行,它将比包含JOIN
's 和WHERE
条件的解决方案快得多。
BTW, note that ORDER_BY
for array_agg
is introduced in Postgresql 9.0
顺便说一句,请注意ORDER_BY
forarray_agg
是在 Postgresql 9.0 中引入的
回答by Eden
There is a new option in Postgressql 9.5 called DISTINCT ON
Postgressql 9.5 中有一个名为 DISTINCT ON 的新选项
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.
它消除了重复的行,只留下我的 ORDER BY 子句定义的第一行。
see the official documentation
看官方文档
回答by Quassnoi
SELECT l.*
FROM (
SELECT DISTINCT usr_id
FROM lives
) lo, lives l
WHERE l.ctid = (
SELECT ctid
FROM lives li
WHERE li.usr_id = lo.usr_id
ORDER BY
time_stamp DESC, trans_id DESC
LIMIT 1
)
Creating an index on (usr_id, time_stamp, trans_id)
will greatly improve this query.
在上创建索引(usr_id, time_stamp, trans_id)
将大大改进此查询。
You should always, always have some kind of PRIMARY KEY
in your tables.
你应该总是,总是PRIMARY KEY
在你的桌子上放一些东西。
回答by Barry Brown
I think you've got one major problem here: there's no monotonically increasing "counter" to guarantee that a given row has happened later in time than another. Take this example:
我认为您在这里遇到了一个主要问题:没有单调递增的“计数器”来保证给定的行发生的时间晚于另一行。拿这个例子:
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
You cannot determine from this data which is the most recent entry. Is it the second one or the last one? There is no sort or max() function you can apply to any of this data to give you the correct answer.
您无法从该数据中确定哪个是最近的条目。这是第二个还是最后一个?没有 sort 或 max() 函数可以应用于任何这些数据来为您提供正确的答案。
Increasing the resolution of the timestamp would be a huge help. Since the database engine serializes requests, with sufficient resolution you can guarantee that no two timestamps will be the same.
提高时间戳的分辨率将是一个巨大的帮助。由于数据库引擎对请求进行序列化,因此通过足够的分辨率,您可以保证没有两个时间戳是相同的。
Alternatively, use a trans_id that won't roll over for a very, very long time. Having a trans_id that rolls over means you can't tell (for the same timestamp) whether trans_id 6 is more recent than trans_id 1 unless you do some complicated math.
或者,使用在很长一段时间内都不会翻转的 trans_id。具有翻转的 trans_id 意味着您无法判断(对于相同的时间戳) trans_id 6 是否比 trans_id 1 更新,除非您进行一些复杂的数学运算。
回答by Turbcool
Another solution you might find useful.
您可能会觉得有用的另一种解决方案。
SELECT t.*
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
FROM lives) as t
WHERE t.r = 1