postgresql Postgres,table1 左连接 table2,table1 中每个 ID 只有 1 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3305709/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Postgres, table1 left join table2 with only 1 row per ID in table1
提问by Peck
Ok, so the title is a bit convoluted. This is basically a greatest-n-per-group type problem, but I can't for the life of me figure it out.
好的,所以标题有点令人费解。这基本上是一个最大的每组 n 类型问题,但我终生无法弄清楚。
I have a table, user_stats:
我有一张表,user_stats:
------------------+---------+---------------------------------------------------------
id | bigint | not null default nextval('user_stats_id_seq'::regclass)
user_id | bigint | not null
datestamp | integer | not null
post_count | integer |
friends_count | integer |
favourites_count | integer |
Indexes:
"user_stats_pk" PRIMARY KEY, btree (id)
"user_stats_datestamp_index" btree (datestamp)
"user_stats_user_id_index" btree (user_id)
Foreign-key constraints:
"user_user_stats_fk" FOREIGN KEY (user_id) REFERENCES user_info(id)
I want to get the stats for each id by latest datestamp. This is a biggish table, somewhere in the neighborhood of 41m rows, so I've created a temp table of user_id, last_date using:
我想通过最新的日期戳获取每个 ID 的统计信息。这是一个很大的表,大约有 41m 行,所以我创建了一个 user_id, last_date 的临时表,使用:
CREATE TEMP TABLE id_max_date AS
(SELECT user_id, MAX(datestamp) AS date FROM user_stats GROUP BY user_id);
The problem is that datestamp isn't unique since there can be more than 1 stat update in a day (should have been a real timestamp but the guy who designed this was kind of an idiot and theres too much data to go back at the moment). So some IDs have multiple rows when I do the JOIN:
问题是 datestamp 不是唯一的,因为一天内可能有 1 个以上的统计更新(应该是一个真正的时间戳,但设计这个的人有点白痴,目前有太多数据无法返回) )。所以当我执行 JOIN 时,一些 ID 有多行:
SELECT user_stats.user_id, user_stats.datestamp, user_stats.post_count,
user_stats.friends_count, user_stats.favorites_count
FROM id_max_date JOIN user_stats
ON id_max_date.user_id=user_stats.user_id AND date=datestamp;
If I was doing this as subselects I guess I could LIMIT 1, but I've always heard those are horribly inefficient. Thoughts?
如果我是作为子选择这样做的,我想我可以限制 1,但我一直听说这些效率非常低。想法?
回答by rfusca
DISTINCT ONis your friend.
DISTINCT ON是你的朋友。
select distinct on (user_id) * from user_stats order by datestamp desc;
回答by Bill Karwin
Basically you need to decide how to resolve ties, and you need some other column besides datestamp
which is guaranteed to be unique (at least over a given user) so it can be used as the tiebreaker. If nothing else, you can use the id
primary key column.
基本上,您需要决定如何解决关系,并且除了datestamp
保证唯一(至少对于给定用户)之外,您还需要一些其他列,以便将其用作决胜局。如果不出意外,您可以使用id
主键列。
Another solution if you're using PostgreSQL 8.4 is windowing functions:
如果您使用 PostgreSQL 8.4,另一个解决方案是窗口函数:
WITH numbered_user_stats AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY datestamp DESC) AS RowNum
FROM user_stats) AS numbered_user_stats
) SELECT u.user_id, u.datestamp, u.post_count, u.friends_count, u.favorites_count
FROM numbered_user_stats AS u
WHERE u.RowNum = 1;
回答by Jonathan Leffler
Using the existing infrastructure, you can use:
使用现有的基础设施,您可以使用:
SELECT u.user_id, u.datestamp,
MAX(u.post_count) AS post_count,
MAX(u.friends_count) AS friends_count,
MAX(u.favorites_count) AS favorites_count
FROM id_max_date AS m JOIN user_stats AS u
ON m.user_id = u.user_id AND m.date = u.datestamp
GROUP BY u.user_id, u.datestamp;
This gives you a single value for each of the 'not necessarily unique' columns. However, it does not absolutely guarantee that the three maxima all appeared in the same row (though there is at least a moderate chance that they will - and that they will all come from the last of entries created on the given day).
这为每个“不一定唯一”列提供了一个值。但是,它并不能绝对保证三个最大值都出现在同一行中(尽管它们至少有中等机会出现 - 而且它们都来自给定日期创建的最后一个条目)。
For this query, the index on date stamp alone is no help; an index on user ID and date stamp could speed this query up considerably - or, perhaps more accurately, it could speed up the query that generates the id_max_date table.
对于这个查询,单独的日期戳索引是没有帮助的;用户 ID 和日期戳上的索引可以大大加快此查询的速度 - 或者,也许更准确地说,它可以加快生成 id_max_date 表的查询的速度。
Clearly, you can also write the id_max_date expression as a sub-query in the FROM clause:
显然,您还可以将 id_max_date 表达式写为 FROM 子句中的子查询:
SELECT u.user_id, u.datestamp,
MAX(u.post_count) AS post_count,
MAX(u.friends_count) AS friends_count,
MAX(u.favorites_count) AS favorites_count
FROM (SELECT u2.user_id, MAX(u2.datestamp) AS date
FROM user_stats AS u2
GROUP BY u2.user_id) AS m
JOIN user_stats AS u ON m.user_id = u.user_id AND m.date = u.datestamp
GROUP BY u.user_id, u.datestamp;