SQL Server 从大表中缓慢选择

Question

提问by pablox

I have a table with about 20+ million records.

我有一张大约有 20 多万条记录的表。

Structure is like:

结构是这样的：

EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT

Table is receiving about 100k+ records each day.

Table 每天接收大约 100k+ 条记录。

I have indexes on each column except MetaId, as it is not used in 'where' clauses

除了 MetaId，我在每一列上都有索引，因为它没有在“where”子句中使用

The problem is when i want to pick up eg. latest 100 records for desired SourceUserId

问题是当我想拿起例如。所需 SourceUserId 的最新 100 条记录

Query sometimes takes up to 4 minutes to execute, which is not acceptable.

查询有时需要长达 4 分钟才能执行，这是不可接受的。

Eg.

例如。

SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
    OR 
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC

I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.

我不能做分区等，因为我使用的是 SQL Server 的标准版本，而企业版太贵了。

I also think that the table is quite small to be that slow.

我也认为桌子很小，所以速度很慢。

I think the problem is with ORDER BY clause as db must go through much bigger set of data.

我认为问题在于 ORDER BY 子句，因为 db 必须通过更大的数据集。

Any ideas how to make it quicker ?

任何想法如何使它更快？

Perhaps relational database is not a good idea for that kind of data.

对于这种数据，关系数据库可能不是一个好主意。

Data is always being picked up ordered by CreatedAt DESC

数据总是由 CreatedAt DESC 排序

Thank you for reading.

感谢您的阅读。

PabloX

巴勃罗

Answer 1

回答by boydc7

You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.

您可能希望为这种类型的查询创建一个复合索引 - 当查询运行缓慢时，它很可能选择向下扫描 CreatedAt 列上的索引并对 SourceUserId 值执行残留过滤器，而实际上您想要发生的是直接跳转到正确排序的给定 SourceUserId 的所有记录 - 为了实现这一点，您需要主要在 SourceUserId（执行相等性检查）上创建复合索引（执行相等性检查），其次在 CreateAt（以保留订单内的顺序）给定的 SourceUserId 值）。您可能还想尝试添加 TypeId，具体取决于此列的选择性。

So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:

因此，最有可能提供最佳可重复性能（尝试并比较）的 2 将是：

Index on (SourceUserId, CreatedAt)
Index on (SourceUserId, TypeId, CreatedAt)

索引 (SourceUserId, CreatedAt)
索引（SourceUserId、TypeId、CreatedAt）

As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.

与往常一样，在确定索引的方式/内容/位置时还需要考虑许多其他考虑因素，正如 Remus 在单独的答案中讨论的那样，一个重要的考虑因素是涵盖查询与保持查找。此外，您还需要考虑写入量、可能的碎片影响（如果有）、单例查找与大型顺序扫描等，等等。

Answer 2

回答by Remus Rusanu

I have indexes on each column except MetaId

除了 MetaId，我对每一列都有索引

Non-covering indexes will likely hit the 'tipping point'and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:

非覆盖索引可能会达到“临界点”，查询将恢复为表扫描。仅仅在每一列上添加一个索引，因为它用在 where 子句中并不等同于良好的索引设计。以您的查询为例，一个好的 100% 覆盖索引将是：

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)

Following index is also usefull, altough it still going to cause lookups:

以下索引也很有用，尽管它仍然会导致查找：

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)

and finaly an index w/o any included column mayhelp, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):

最后一个不包含任何包含列的索引可能会有所帮助，但很可能会被忽略（取决于列统计信息和基数估计）：

INDEX ON (SourceUserId , CreatedAt)

But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.

但是 SourceUSerId 上的单独索引和 CreatedAt 上的单独索引对于您的查询基本上是无用的。

See Index Design Basics.

请参阅索引设计基础。

Answer 3

回答by Enrico Campidoglio

The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:

该表具有建立在 GUID 值上的索引这一事实表明可能存在一系列影响性能的问题：

High index fragmentation:since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
High number of page splits:the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
Slow value comparison:comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.

索引碎片化程度高：由于新的GUID是随机生成的，索引不能按顺序组织它们，节点分布不均匀。
大量的页面拆分：GUID 的大小（16 字节）会导致索引中的许多页面拆分，因为新值有更大的可能性无法容纳在页面中的剩余可用空间中。
慢值比较：比较两个 GUID 是一个相对较慢的操作，因为必须匹配所有 33 个字符。

Here a couple of resources on how to investigate and resolve these problems:

这里有一些关于如何调查和解决这些问题的资源：

Answer 4

回答by Keith Williams

We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.

通过为我们的事件表移动到一个 BIGINT IDENTITY 键，我们实现了一个小小的收获；通过将其用作集群主键，我们可以欺骗并将其用于日期排序。

Answer 5

回答by Adriaan Stander

I would recomend getting the data in 2 sep var tables

我建议在 2 sep var 表中获取数据

INSERT INTO @Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
)
INSERT INTO @Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)

then apply a unoin from the selects, ordered and top. Limit the data from the get go.

然后从选择、排序和顶部应用 unoin。从一开始就限制数据。

Answer 6

回答by OMG Ponies

I suggest using a UNION:

我建议使用联合：

SELECT TOP 100 x.*
  FROM (SELECT a.*
          FROM EVENTS a
         WHERE a.typeid IN (2, 3, 4)
        UNION ALL
        SELECT b.*
          FROM EVENTS b
         WHERE b.typeid = 60 
           AND b.srcmemberid != b.dstmemberid) x
 WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'

Answer 7

回答by Leon

If there are 100K records added each day, you should check your index fragmentation. And rebuild or reorganize it accordingly. More info : SQLauthority

如果每天添加 100K 条记录，则应检查索引碎片。并相应地重建或重组它。更多信息： SQLauthority

Answer 8

回答by Jim B

I would make sure CreatedAt is indexed properly

我会确保 CreatedAt 被正确索引

Answer 9

回答by vladhorby

you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like

您可以使用 UNION 将查询分成两部分以避免 OR（这可能导致您的索引不被使用），例如

   SElect * FROM(
 SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION  SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461' 
 AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC

Also, check that the uniqueidentifier indexes are not CLUSTERED.

另外，检查uniqueidentifier 索引是否不是CLUSTERED。

SQL Server 从大表中缓慢选择

提问by pablox

回答by boydc7

回答by Remus Rusanu

回答by Enrico Campidoglio

回答by Keith Williams

回答by Adriaan Stander

回答by OMG Ponies

回答by Leon

回答by Jim B

回答by vladhorby

相关推荐

最近更新

标签

SQL Server 从大表中缓慢选择

提问by pablox

回答by boydc7

回答by Remus Rusanu

回答by Enrico Campidoglio

回答by Keith Williams

回答by Adriaan Stander

回答by OMG Ponies

回答by Leon

回答by Jim B

回答by vladhorby

相关推荐

SQL Server CONVERT(NUMERIC(18,0), '') 失败但 CONVERT(INT, '') 成功？

SQL 更新或删除的行值要么不会使行唯一，要么会改变多行

SQL：按电子邮件域名排序

使用一个 SQL 查询更新多行上的单列

相关推荐

最近更新

标签