SQL 按年、月、周、日、小时分组 SQL 与程序性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/482912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL Group By Year, Month, Week, Day, Hour SQL vs Procedural Performance
提问by RSlaughter
I need to write a query that will group a large number of records by periods of time from Year to Hour.
我需要编写一个查询,该查询将按从年到小时的时间段对大量记录进行分组。
My initial approach has been to decide the periods procedurally in C#, iterate through each and run the SQL to get the data for that period, building up the dataset as I go.
我最初的方法是在 C# 中按程序确定周期,遍历每个周期并运行 SQL 以获取该周期的数据,并在我进行时构建数据集。
SELECT Sum(someValues)
FROM table1
WHERE deliveryDate BETWEEN @fromDate AND @ toDate
I've subsequently discovered I can group the records using Year(), Month() Day(), and datepart(week, date) and datepart(hh, date).
我随后发现我可以使用 Year()、Month() Day() 和 datepart(week, date) 和 datepart(hh, date) 对记录进行分组。
SELECT Sum(someValues)
FROM table1
GROUP BY Year(deliveryDate), Month(deliveryDate), Day(deliveryDate)
My concern is that using datepart in a group by will lead to worse performance than running the query multiple times for a set period of time due to not being able to use the index on the datetime field as efficiently; any thoughts as to whether this is true?
我担心的是,由于无法有效地使用日期时间字段上的索引,因此在 group by 中使用 datepart 会导致性能比在一段时间内多次运行查询更差;关于这是否属实的任何想法?
Thanks.
谢谢。
采纳答案by ShuggyCoUk
As with anything performance related Measure
与任何与性能相关的措施一样
Checking the query plan up for the second approach will tell you any obvious problems in advance (a full table scan when you know one is not needed) but there is no substitute for measuring. In SQL performance testing that measurement should be done with appropriate sizes of test data.
检查第二种方法的查询计划会提前告诉您任何明显的问题(当您知道不需要全表扫描时),但没有替代测量。在 SQL 性能测试中,应该使用适当大小的测试数据来完成测量。
Since this is a complex case, you are not simply comparing two different ways to do a single query but comparing a single query approach against a iterative one, aspects of your environment may play a major role in the actual performance.
由于这是一个复杂的案例,您不是简单地比较执行单个查询的两种不同方法,而是将单个查询方法与迭代方法进行比较,您的环境的各个方面可能在实际性能中发挥重要作用。
Specifically
具体来说
- the 'distance' between your application and the database as the latency of each call will be wasted time compared to the one big query approach
- Whether you are using prepared statements or not (causing additional parsing effort for the database engine on each query)
- whether the construction of the ranges queries itself is costly (heavily influenced by 2)
- 与大型查询方法相比,您的应用程序和数据库之间的“距离”,因为每次调用的延迟都会浪费时间
- 是否使用准备好的语句(导致每个查询的数据库引擎的额外解析工作)
- 范围查询本身的构建是否成本高(严重受 2 影响)
回答by Galwegian
If you put a formula into the field part of a comparison, you get a table scan.
如果将公式放入比较的字段部分,则会得到表扫描。
The index is on field, not on datepart(field), so ALL fields must be calculated- so I think your hunch is right.
索引在字段上,而不是在 datepart(field) 上,因此必须计算所有字段- 所以我认为您的预感是正确的。
回答by Mladen Prajdic
you could do something similar to this:
你可以做类似的事情:
SELECT Sum(someValues)
FROM
(
SELECT *, Year(deliveryDate) as Y, Month(deliveryDate) as M, Day(deliveryDate) as D
FROM table1
WHERE deliveryDate BETWEEN @fromDate AND @ toDate
) t
GROUP BY Y, M, D
回答by Walter Mitty
If you can tolerate the performance hit of joining in yet one more table, I have a suggestion that seems odd but works real well.
如果您可以容忍加入另一个表对性能的影响,我有一个看起来很奇怪但效果很好的建议。
Create a table that I'll call ALMANAC with columns like weekday, month, year. You can even add columns for company specific features of a date, like whether the date is a company holiday or not. You might want to add a starting and ending timestamp, as referenced below.
创建一个我将称之为 ALMANAC 的表,其中包含诸如工作日、月、年之类的列。您甚至可以为日期的公司特定功能添加列,例如该日期是否为公司假期。您可能想要添加开始和结束时间戳,如下所述。
Although you might get by with one row per day, when I did this I found it convenient to go with one row per shift, where there are three shifts in a day. Even at that rate, a period of ten years was only a little over 10,000 rows.
虽然你可能每天只能排一排,但当我这样做时,我发现每班排一排很方便,一天有三班。即使按照这个速度,十年的时间也只是略高于 10,000 行。
When you write the SQL to populate this table, you can make use of all the date oriented built in functions to make the job easier. When you go to do queries you can use the date column as a join condition, or you may need two timestamps to provide a range for catching timestamps within the range. The rest of it is as easy as working with any other kind of data.
当您编写 SQL 来填充此表时,您可以利用所有面向日期的内置函数来简化工作。当您进行查询时,您可以使用日期列作为连接条件,或者您可能需要两个时间戳来提供一个范围来捕获该范围内的时间戳。其余的就像处理任何其他类型的数据一样简单。
回答by alextansc
I was looking for similar solution for reporting purposes, and came across this article called Group by Month (and other time periods). It shows various ways, good and bad, to group by the datetime field. Definitely worth looking at.
我正在寻找用于报告目的的类似解决方案,并遇到了这篇名为Group by Month (and other time period) 的文章。它显示了按日期时间字段分组的各种方式,无论好坏。绝对值得一看。
回答by Frederik Gheysels
I think that you should benchmark it to get reliable results , but, IMHO and my first thought would be that letting the DB take care of it (your 2nd approach) would be much faster then when you do it in your client code. With your first approach, you have multiple roundtrips to the DB, which I think will be far more expensive. :)
我认为您应该对其进行基准测试以获得可靠的结果,但是,恕我直言,我的第一个想法是让数据库处理它(您的第二种方法)会比您在客户端代码中执行它时快得多。使用第一种方法,您可以多次往返数据库,我认为这会贵得多。:)
回答by Cade Roux
You may want to look at a dimensional approach (this is simliar to what Walter Mitty has suggested), where each row has a foreign key to a date and/or time dimension. This allows very flexible summations through the join to this table where these parts are precalculated. In these cases, the key is usually a natural integer key of the form YYYYMMDD and HHMMSS which is relatively performant and also human readable.
您可能需要查看维度方法(这类似于 Walter Mitty 的建议),其中每一行都有一个指向日期和/或时间维度的外键。这允许通过连接到预先计算这些部分的表进行非常灵活的求和。在这些情况下,密钥通常是 YYYYMMDD 和 HHMMSS 形式的自然整数密钥,其性能相对较高且人类可读。
Another alternative might be indexed views, where there are separate expressions for each of the date parts.
另一种选择可能是索引视图,其中每个日期部分都有单独的表达式。
Or calculated columns.
或计算列。
But performance has to be tested and execution plans examined...
但是必须测试性能并检查执行计划......