postgresql Postgres，table1 左连接 table2，table1 中每个 ID 只有 1 行

Question

提问by Peck

Ok, so the title is a bit convoluted. This is basically a greatest-n-per-group type problem, but I can't for the life of me figure it out.

好的，所以标题有点令人费解。这基本上是一个最大的每组 n 类型问题，但我终生无法弄清楚。

I have a table, user_stats:

我有一张表，user_stats：

------------------+---------+---------------------------------------------------------
 id               | bigint  | not null default nextval('user_stats_id_seq'::regclass)
 user_id          | bigint  | not null
 datestamp        | integer | not null
 post_count       | integer | 
 friends_count    | integer | 
 favourites_count | integer |  
Indexes:
    "user_stats_pk" PRIMARY KEY, btree (id)
    "user_stats_datestamp_index" btree (datestamp)
    "user_stats_user_id_index" btree (user_id)
Foreign-key constraints:
    "user_user_stats_fk" FOREIGN KEY (user_id) REFERENCES user_info(id)

I want to get the stats for each id by latest datestamp. This is a biggish table, somewhere in the neighborhood of 41m rows, so I've created a temp table of user_id, last_date using:

我想通过最新的日期戳获取每个 ID 的统计信息。这是一个很大的表，大约有 41m 行，所以我创建了一个 user_id, last_date 的临时表，使用：

CREATE TEMP TABLE id_max_date AS
    (SELECT user_id, MAX(datestamp) AS date FROM user_stats GROUP BY user_id);

The problem is that datestamp isn't unique since there can be more than 1 stat update in a day (should have been a real timestamp but the guy who designed this was kind of an idiot and theres too much data to go back at the moment). So some IDs have multiple rows when I do the JOIN:

问题是 datestamp 不是唯一的，因为一天内可能有 1 个以上的统计更新（应该是一个真正的时间戳，但设计这个的人有点白痴，目前有太多数据无法返回））。所以当我执行 JOIN 时，一些 ID 有多行：

SELECT user_stats.user_id, user_stats.datestamp, user_stats.post_count,
       user_stats.friends_count, user_stats.favorites_count
  FROM id_max_date JOIN user_stats
    ON id_max_date.user_id=user_stats.user_id AND date=datestamp;

If I was doing this as subselects I guess I could LIMIT 1, but I've always heard those are horribly inefficient. Thoughts?

如果我是作为子选择这样做的，我想我可以限制 1，但我一直听说这些效率非常低。想法？

Answer 1

回答by rfusca

DISTINCT ONis your friend.

DISTINCT ON是你的朋友。

select distinct on (user_id) * from user_stats order by datestamp desc;

Answer 2

回答by Bill Karwin

Basically you need to decide how to resolve ties, and you need some other column besides datestampwhich is guaranteed to be unique (at least over a given user) so it can be used as the tiebreaker. If nothing else, you can use the idprimary key column.

基本上，您需要决定如何解决关系，并且除了datestamp保证唯一（至少对于给定用户）之外，您还需要一些其他列，以便将其用作决胜局。如果不出意外，您可以使用id主键列。

Another solution if you're using PostgreSQL 8.4 is windowing functions:

如果您使用 PostgreSQL 8.4，另一个解决方案是窗口函数：

WITH numbered_user_stats AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY datestamp DESC) AS RowNum
    FROM user_stats) AS numbered_user_stats
) SELECT u.user_id, u.datestamp, u.post_count, u.friends_count, u.favorites_count
FROM numbered_user_stats AS u
WHERE u.RowNum = 1;

Answer 3

回答by Jonathan Leffler

Using the existing infrastructure, you can use:

使用现有的基础设施，您可以使用：

SELECT u.user_id, u.datestamp,
       MAX(u.post_count)      AS post_count,
       MAX(u.friends_count)   AS friends_count,
       MAX(u.favorites_count) AS favorites_count
  FROM id_max_date AS m JOIN user_stats AS u
    ON m.user_id = u.user_id AND m.date = u.datestamp
 GROUP BY u.user_id, u.datestamp;

This gives you a single value for each of the 'not necessarily unique' columns. However, it does not absolutely guarantee that the three maxima all appeared in the same row (though there is at least a moderate chance that they will - and that they will all come from the last of entries created on the given day).

这为每个“不一定唯一”列提供了一个值。但是，它并不能绝对保证三个最大值都出现在同一行中（尽管它们至少有中等机会出现 - 而且它们都来自给定日期创建的最后一个条目）。

For this query, the index on date stamp alone is no help; an index on user ID and date stamp could speed this query up considerably - or, perhaps more accurately, it could speed up the query that generates the id_max_date table.

对于这个查询，单独的日期戳索引是没有帮助的；用户 ID 和日期戳上的索引可以大大加快此查询的速度 - 或者，也许更准确地说，它可以加快生成 id_max_date 表的查询的速度。

Clearly, you can also write the id_max_date expression as a sub-query in the FROM clause:

显然，您还可以将 id_max_date 表达式写为 FROM 子句中的子查询：

SELECT u.user_id, u.datestamp,
       MAX(u.post_count)      AS post_count,
       MAX(u.friends_count)   AS friends_count,
       MAX(u.favorites_count) AS favorites_count
  FROM (SELECT u2.user_id, MAX(u2.datestamp) AS date
          FROM user_stats AS u2
         GROUP BY u2.user_id) AS m
  JOIN user_stats AS u ON m.user_id = u.user_id AND m.date = u.datestamp
 GROUP BY u.user_id, u.datestamp;

postgresql Postgres，table1 左连接 table2，table1 中每个 ID 只有 1 行

提问by Peck

回答by rfusca

回答by Bill Karwin

回答by Jonathan Leffler

相关推荐

最近更新

标签

postgresql Postgres，table1 左连接 table2，table1 中每个 ID 只有 1 行

提问by Peck

回答by rfusca

回答by Bill Karwin

回答by Jonathan Leffler

相关推荐

postgresql 如何在PostgreSQL中选择以数字开头的行？

postgresql 不区分大小写的重复 SQL

如何在没有 matlabs 数据库工具箱的情况下从 matlab 访问 postgresql 数据库？

我可以要求 Postgresql 忽略事务中的错误吗

相关推荐

最近更新

标签