MySQL order by before group by

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14770671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 16:28:34  来源:igfitidea点击:

MySQL order by before group by

mysqlgroup-bysql-order-by

提问by Rob Forrest

There are plenty of similar questions to be found on here but I don't think that any answer the question adequately.

在这里可以找到很多类似的问题,但我认为没有人能充分回答这个问题。

I'll continue from the current most popular questionand use their example if that's alright.

如果可以的话,我将从当前最流行的问题继续,并使用他们的示例。

The task in this instance is to get the latest post for each author in the database.

本例中的任务是获取数据库中每个作者的最新帖子。

The example query produces unusable results as its not always the latest post that is returned.

示例查询产生不可用的结果,因为它并不总是返回的最新帖子。

SELECT wp_posts.* FROM wp_posts
    WHERE wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
    GROUP BY wp_posts.post_author           
    ORDER BY wp_posts.post_date DESC

The current accepted answer is

当前接受的答案是

SELECT
    wp_posts.*
FROM wp_posts
WHERE
    wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author
HAVING wp_posts.post_date = MAX(wp_posts.post_date) <- ONLY THE LAST POST FOR EACH AUTHOR
ORDER BY wp_posts.post_date DESC

Unfortunately this answer is plain and simple wrong and in many cases produces less stable results than the orginal query.

不幸的是,这个答案是简单而简单的错误,并且在许多情况下产生的结果不如原始查询稳定。

My best solution is to use a subquery of the form

我最好的解决方案是使用表单的子查询

SELECT wp_posts.* FROM 
(
    SELECT * 
    FROM wp_posts
    ORDER BY wp_posts.post_date DESC
) AS wp_posts
WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author 

My question is a simple one then: Is there anyway to order rows before grouping without resorting to a subquery?

我的问题是一个简单的问题: 无论如何在分组之前对行进行排序而不诉诸子查询?

Edit: This question was a continuation from another question and the specifics of my situation are slightly different. You can (and should) assume that there is also a wp_posts.id that is a unique identifier for that particular post.

编辑:这个问题是另一个问题的延续,我的具体情况略有不同。您可以(并且应该)假设还有一个 wp_posts.id 是该特定帖子的唯一标识符。

回答by Taryn

Using an ORDER BYin a subquery is not the best solution to this problem.

ORDER BY在子查询中使用 an并不是这个问题的最佳解决方案。

The best solution to get the max(post_date)by author is to use a subquery to return the max date and then join that to your table on both the post_authorand the max date.

获取max(post_date)作者的最佳解决方案是使用子查询返回最大日期,然后在最大日期和最大日期将其加入您的表post_author

The solution should be:

解决办法应该是:

SELECT p1.* 
FROM wp_posts p1
INNER JOIN
(
    SELECT max(post_date) MaxPostDate, post_author
    FROM wp_posts
    WHERE post_status='publish'
       AND post_type='post'
    GROUP BY post_author
) p2
  ON p1.post_author = p2.post_author
  AND p1.post_date = p2.MaxPostDate
WHERE p1.post_status='publish'
  AND p1.post_type='post'
order by p1.post_date desc

If you have the following sample data:

如果您有以下示例数据:

CREATE TABLE wp_posts
    (`id` int, `title` varchar(6), `post_date` datetime, `post_author` varchar(3))
;

INSERT INTO wp_posts
    (`id`, `title`, `post_date`, `post_author`)
VALUES
    (1, 'Title1', '2013-01-01 00:00:00', 'Jim'),
    (2, 'Title2', '2013-02-01 00:00:00', 'Jim')
;

The subquery is going to return the max date and author of:

子查询将返回最大日期和作者:

MaxPostDate | Author
2/1/2013    | Jim

Then since you are joining that back to the table, on both values you will return the full details of that post.

然后,由于您将其加入到表格中,因此您将在这两个值上返回该帖子的完整详细信息。

See SQL Fiddle with Demo.

请参阅SQL Fiddle with Demo

To expand on my comments about using a subquery to accurate return this data.

扩展我关于使用子查询准确返回此数据的评论。

MySQL does not force you to GROUP BYevery column that you include in the SELECTlist. As a result, if you only GROUP BYone column but return 10 columns in total, there is no guarantee that the other column values which belong to the post_authorthat is returned. If the column is not in a GROUP BYMySQL chooses what value should be returned.

MySQL 不会强制您访问GROUP BY包含在SELECT列表中的每一列。因此,如果您只GROUP BY返回一列但总共返回 10 列,则无法保证post_author返回属于 的其他列值。如果该列不在GROUP BYMySQL 中,则选择应返回的值。

Using the subquery with the aggregate function will guarantee that the correct author and post is returned every time.

将子查询与聚合函数一起使用将保证每次都返回正确的作者和帖子。

As a side note, while MySQL allows you to use an ORDER BYin a subquery and allows you to apply a GROUP BYto not every column in the SELECTlist this behavior is not allowed in other databases including SQL Server.

附带说明一下,虽然 MySQL 允许您ORDER BY在子查询中使用a 并允许您将 aGROUP BY应用于并非SELECT列表中的每一列,但这种行为在包括 SQL Server 在内的其他数据库中是不允许的。

回答by fthiella

Your solution makes use of an extension to GROUP BYclause that permits to group by some fields (in this case, just post_author):

您的解决方案使用了GROUP BY子句的扩展,允许按某些字段进行分组(在这种情况下,只是post_author):

GROUP BY wp_posts.post_author

and select nonaggregated columns:

并选择非聚合列:

SELECT wp_posts.*

that are not listed in the group by clause, or that are not used in an aggregate function (MIN, MAX, COUNT, etc.).

未在 group by 子句中列出的,或未在聚合函数中使用的(MIN、MAX、COUNT 等)。

Correct use of extension to GROUP BY clause

正确使用 GROUP BY 子句的扩展

This is useful when all values of non-aggregated columns are equal for every row.

当每一行的非聚合列的所有值都相等时,这很有用。

For example, suppose you have a table GardensFlowers(nameof the garden, flowerthat grows in the garden):

例如,假设你有一张桌子GardensFlowersname花园里的,flower长在花园里):

INSERT INTO GardensFlowers VALUES
('Central Park',       'Magnolia'),
('Hyde Park',          'Tulip'),
('Gardens By The Bay', 'Peony'),
('Gardens By The Bay', 'Cherry Blossom');

and you want to extract all the flowers that grows in a garden, where multiple flowers grow. Then you have to use a subquery, for example you could use this:

并且您想提取生长在花园中的所有花朵,花园中生长着多种花朵。然后你必须使用子查询,例如你可以使用这个:

SELECT GardensFlowers.*
FROM   GardensFlowers
WHERE  name IN (SELECT   name
                FROM     GardensFlowers
                GROUP BY name
                HAVING   COUNT(DISTINCT flower)>1);

If you need to extract all the flowers that are the only flowers in the garder instead, you could just change the HAVING condition to HAVING COUNT(DISTINCT flower)=1, but MySql also allows you to use this:

如果您需要提取花园中唯一的所有花朵,您可以将 HAVING 条件更改为HAVING COUNT(DISTINCT flower)=1,但 MySql 也允许您使用:

SELECT   GardensFlowers.*
FROM     GardensFlowers
GROUP BY name
HAVING   COUNT(DISTINCT flower)=1;

no subquery, not standard SQL, but simpler.

没有子查询,不是标准的 SQL,但更简单。

Incorrect use of extension to GROUP BY clause

错误使用对 GROUP BY 子句的扩展

But what happens if you SELECT non-aggregated columns that are non equal for every row? Which is the value that MySql chooses for that column?

但是,如果您选择每行不相等的非聚合列会发生什么?MySql 为该列选择哪个值?

It looks like MySql always chooses the FIRSTvalue it encounters.

看起来 MySql 总是选择它遇到的第一个值。

To make sure that the first value it encounters is exactly the value you want, you need to apply a GROUP BYto an ordered query, hence the need to use a subquery. You can't do it otherwise.

为了确保它遇到的第一个值正是您想要的值,您需要将 aGROUP BY应用于有序查询,因此需要使用子查询。否则你不能这样做。

Given the assumption that MySql always chooses the first row it encounters, you are correcly sorting the rows before the GROUP BY. But unfortunately, if you read the documentation carefully, you'll notice that this assumption is not true.

假设 MySql 总是选择它遇到的第一行,您正确地在 GROUP BY 之前对行进行排序。但不幸的是,如果你仔细阅读文档,你会发现这个假设是不正确的。

When selecting non-aggregated columns that are not always the same, MySql is free to choose any value, so the resulting value that it actually shows is indeterminate.

When selecting non-aggregated columns that are not always the same, MySql is free to choose any value, so the resulting value that it actually shows is indeterminate.

I see that this trick to get the first value of a non-aggregated column is used a lot, and it usually/almost always works, I use it as well sometimes (at my own risk). But since it's not documented, you can't rely on this behaviour.

我看到这个获取非聚合列的第一个值的技巧被大量使用,它通常/几乎总是有效,我有时也会使用它(风险自负)。但由于它没有记录,你不能依赖这种行为。

This link (thanks ypercube!) GROUP BY trick has been optimized awayshows a situation in which the same query returns different results between MySql and MariaDB, probably because of a different optimization engine.

这个链接(感谢 ypercube!)GROUP BY 技巧已经被优化掉显示了在 MySql 和 MariaDB 之间相同的查询返回不同结果的情况,可能是因为不同的优化引擎。

So, if this trick works, it's just a matter of luck.

所以,如果这个技巧奏效,那只是运气问题。

The accepted answer on the other questionlooks wrong to me:

另一个问题公认答案在我看来是错误的:

HAVING wp_posts.post_date = MAX(wp_posts.post_date)

wp_posts.post_dateis a non-aggregated column, and its value will be officially undetermined, but it will likely be the first post_dateencountered. But since the GROUP BY trick is applied to an unordered table, it is not sure which is the first post_dateencountered.

wp_posts.post_date是一个非聚合列,它的值将是官方未确定的,但它很可能是第一个post_date遇到的。但是由于 GROUP BY 技巧应用于无序表,因此不确定哪个是第一个post_date遇到的。

It will probably returns posts that are the only posts of a single author, but even this is not always certain.

它可能会返回单个作者唯一的帖子,但即使这样也不一定总是确定的。

A possible solution

一个可能的解决方案

I think that this could be a possible solution:

我认为这可能是一个可能的解决方案:

SELECT wp_posts.*
FROM   wp_posts
WHERE  id IN (
  SELECT max(id)
  FROM wp_posts
  WHERE (post_author, post_date) = (
    SELECT   post_author, max(post_date)
    FROM     wp_posts
    WHERE    wp_posts.post_status='publish'
             AND wp_posts.post_type='post'
    GROUP BY post_author
  ) AND wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
  GROUP BY post_author
)

On the inner query I'm returning the maximum post date for every author. I'm then taking into consideration the fact that the same author could theorically have two posts at the same time, so I'm getting only the maximum ID. And then I'm returning all rows that have those maximum IDs. It could be made faster using joins instead of IN clause.

在内部查询中,我返回每个作者的最大发布日期。然后我考虑到同一作者理论上可以同时拥有两个帖子的事实,所以我只得到最大的 ID。然后我将返回具有这些最大 ID 的所有行。使用连接而不是 IN 子句可以加快速度。

(If you're sure that IDis only increasing, and if ID1 > ID2also means that post_date1 > post_date2, then the query could be made much more simple, but I'm not sure if this is the case).

(如果您确定这ID只会增加,并且ID1 > ID2也意味着post_date1 > post_date2,那么查询可以变得更简单,但我不确定是否是这种情况)。

回答by newtover

What you are going to read is rather hacky, so don't try this at home!

您将要阅读的内容相当笨拙,所以不要在家里尝试!

In SQL in general the answer to your question is NO, but because of the relaxed mode of the GROUP BY(mentioned by @bluefeet), the answer is YESin MySQL.

通常在 SQL 中,您的问题的答案是NO,但由于GROUP BY@bluefeet提到)的放松模式,MySQL 中的答案是YES

Suppose, you have a BTREE index on (post_status, post_type, post_author, post_date). How does the index look like under the hood?

假设您在 (post_status, post_type, post_author, post_date) 上有一个 BTREE 索引。引擎盖下的索引是什么样子的?

(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01') (post_status='publish', post_type='post', post_author='user B', post_date='2012-12-01')

(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01') (post_status='publish', post_type=' post', post_author='user B', post_date='2012-12-01')

That is data is sorted by all those fields in ascending order.

也就是说,数据按所有这些字段按升序排序。

When you are doing a GROUP BYby default it sorts data by the grouping field (post_author, in our case; post_status, post_type are required by the WHEREclause) and if there is a matching index, it takes data for each first record in ascending order. That is the query will fetch the following (the first post for each user):

GROUP BY默认情况下,当您执行 a 时,它按分组字段对数据进行排序(post_author在我们的例子中;WHERE子句需要 post_status、post_type ),如果有匹配的索引,它会按升序获取每个第一条记录的数据。即查询将获取以下内容(每个用户的第一篇文章):

(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01')

(post_status='publish', post_type='post', post_author='user A', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user B', post_date='2012-10-01')

But GROUP BYin MySQL allows you to specify the order explicitly. And when you request post_userin descending order, it will walk through our index in the opposite order, still taking the first record for each group which is actually last.

但是GROUP BY在 MySQL 中允许您明确指定顺序。当你post_user按降序请求时,它会以相反的顺序遍历我们的索引,仍然为每个组取第一条记录,实际上是最后一条记录。

That is

那是

...
WHERE wp_posts.post_status='publish' AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author DESC

will give us

会给我们

(post_status='publish', post_type='post', post_author='user B', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31')

(post_status='publish', post_type='post', post_author='user B', post_date='2012-12-01') (post_status='publish', post_type='post', post_author='user A', post_date='2012-12-31')

Now, when you order the results of the grouping by post_date, you get the data you wanted.

现在,当您按 post_date 对分组结果进行排序时,您将获得所需的数据。

SELECT wp_posts.*
FROM wp_posts
WHERE wp_posts.post_status='publish' AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author DESC
ORDER BY wp_posts.post_date DESC;

NB:

注意

This is not what I would recommend for this particular query. In this case, I would use a slightly modified version of what @bluefeetsuggests. But this technique might be very useful. Take a look at my answer here: Retrieving the last record in each group

这不是我为这个特定查询推荐的。在这种情况下,我会使用@bluefeet建议的稍微修改的版本。但是这种技术可能非常有用。看看我的回答:Retrieving the last record in each group

Pitfalls: The disadvantages of the approach is that

陷阱:该方法的缺点是

  • the result of the query depends on the index, which is against the spirit of the SQL (indexes should only speed up queries);
  • index does not know anything about its influence on the query (you or someone else in future might find the index too resource-consuming and change it somehow, breaking the query results, not only its performance)
  • if you do not understand how the query works, most probably you'll forget the explanation in a month and the query will confuse you and your colleagues.
  • 查询的结果依赖于索引,这违背了 SQL 的精神(索引应该只是加速查询);
  • 索引对其对查询的影响一无所知(您或将来的其他人可能会发现索引过于消耗资源并以某种方式对其进行更改,从而破坏查询结果,而不仅仅是其性能)
  • 如果您不了解查询的工作原理,很可能您会在一个月内忘记解释,并且查询会使您和您的同事感到困惑。

The advantage is performance in hard cases. In this case, the performance of the query should be the same as in @bluefeet's query, because of amount of data involved in sorting (all data is loaded into a temporary table and then sorted; btw, his query requires the (post_status, post_type, post_author, post_date)index as well).

优点是在硬情况下的性能。在这种情况下,查询的性能应该与@bluefeet 的查询相同,因为排序涉及的数据量很大(所有数据都加载到临时表中然后进行排序;顺便说一句,他的查询也需要(post_status, post_type, post_author, post_date)索引) .

What I would suggest:

我的建议是

As I said, those queries make MySQL waste time sorting potentially huge amounts of data in a temporary table. In case you need paging (that is LIMIT is involved) most of the data is even thrown off. What I would do is minimize the amount of sorted data: that is sort and limit a minimum of data in the subquery and then join back to the whole table.

正如我所说,这些查询使 MySQL 浪费时间对临时表中潜在的大量数据进行排序。如果您需要分页(即涉及 LIMIT),甚至会丢弃大部分数据。我会做的是最小化排序数据的数量:即排序并限制子查询中的最少数据,然后连接回整个表。

SELECT * 
FROM wp_posts
INNER JOIN
(
  SELECT max(post_date) post_date, post_author
  FROM wp_posts
  WHERE post_status='publish' AND post_type='post'
  GROUP BY post_author
  ORDER BY post_date DESC
  -- LIMIT GOES HERE
) p2 USING (post_author, post_date)
WHERE post_status='publish' AND post_type='post';

The same query using the approach described above:

使用上述方法的相同查询:

SELECT *
FROM (
  SELECT post_id
  FROM wp_posts
  WHERE post_status='publish' AND post_type='post'
  GROUP BY post_author DESC
  ORDER BY post_date DESC
  -- LIMIT GOES HERE
) as ids
JOIN wp_posts USING (post_id);

All those queries with their execution plans on SQLFiddle.

所有这些查询及其在SQLFiddle上的执行计划。

回答by sanchitkhanna26

Try this one. Just get the list of latest post dates from each author. Thats it

试试这个。只需从每个作者那里获取最新发布日期的列表。就是这样

SELECT wp_posts.* FROM wp_posts WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post' AND wp_posts.post_date IN(SELECT MAX(wp_posts.post_date) FROM wp_posts GROUP BY wp_posts.post_author) 

回答by Dennisch

No. It makes no sense to order the records before grouping, since grouping is going to mutate the result set. The subquery way is the preferred way. If this is going too slow you would have to change your table design, for example by storing the id of of the last post for each author in a seperate table, or introduce a boolean column indicating for each author which of his post is the last one.

不。在分组之前对记录进行排序是没有意义的,因为分组会改变结果集。子查询方式是首选方式。如果这太慢了,你将不得不改变你的表格设计,例如通过将每个作者的最后一篇文章的 id 存储在一个单独的表格中,或者引入一个布尔列,为每个作者指明他的哪篇文章是最后一篇一。

回答by Konstantin XFlash Stratigenas

Just use the max function and group function

只需使用 max 函数和 group 函数

    select max(taskhistory.id) as id from taskhistory
            group by taskhistory.taskid
            order by taskhistory.datum desc

回答by Strawberry

Just to recap, the standard solution uses an uncorrelated subquery and looks like this:

回顾一下,标准解决方案使用不相关的子查询,如下所示:

SELECT x.*
  FROM my_table x
  JOIN (SELECT grouping_criteria,MAX(ranking_criterion) max_n FROM my_table GROUP BY grouping_criteria) y
    ON y.grouping_criteria = x.grouping_criteria
   AND y.max_n = x.ranking_criterion;

If you're using an ancient version of MySQL, or a fairly small data set, then you can use the following method:

如果您使用的是旧版 MySQL 或相当小的数据集,则可以使用以下方法:

SELECT x.*
  FROM my_table x
  LEFT
  JOIN my_table y
    ON y.joining_criteria = x.joining_criteria
   AND y.ranking_criteria < x.ranking_criteria
 WHERE y.some_non_null_column IS NULL;  

回答by guykaplan

** Sub queries may have a bad impact on performance when used with large datasets **

** 与大型数据集一起使用时,子查询可能会对性能产生不良影响 **

Original query

原始查询

SELECT wp_posts.*
FROM   wp_posts
WHERE  wp_posts.post_status = 'publish'
       AND wp_posts.post_type = 'post'
GROUP  BY wp_posts.post_author
ORDER  BY wp_posts.post_date DESC; 

Modified query

修改查询

SELECT p.post_status,
       p.post_type,
       Max(p.post_date),
       p.post_author
FROM   wp_posts P
WHERE  p.post_status = "publish"
       AND p.post_type = "post"
GROUP  BY p.post_author
ORDER  BY p.post_date; 

becasue i'm using maxin the select clause==> max(p.post_date)it is possible to avoid sub select queries and order by the max column after the group by.

因为我maxselect clause==> 中使用,max(p.post_date)所以可以避免子选择查询并按分组后的最大列排序。

回答by Bruno Nardini

First, don't use * in select, affects their performance and hinder the use of the group by and order by. Try this query:

首先,不要在select中使用*,影响它们的性能,阻碍group by和order by的使用。试试这个查询:

SELECT wp_posts.post_author, wp_posts.post_date as pdate FROM wp_posts
WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author           
ORDER BY pdate DESC

When you don't specifies the table in ORDER BY, just the alias, they will order the result of the select.

当您没有在 ORDER BY 中指定表,只指定别名时,他们将对选择的结果进行排序。