如何解决间歇性 SQL 超时错误

Question

提问by Shawn Steward

We've been having a few instances per day where we get a slew of SQL Timeout errors from multiple applications (System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.) We have over 100 different applications on our network, both web and desktop apps. Everything from VB6 and Classic ASP to .NET 4. I can find all kinds of data that show the side effects but can't pinpoint what is causing this. Our DBA says nothing is wrong with the SQL server, and IT says there's nothing wrong with the web servers or network, so of course I'm left in the middle trying to troubleshoot this.

我们每天都有几个实例，我们从多个应用程序中收到大量 SQL 超时错误（System.Data.SqlClient.SqlException: Timeout expired。在操作完成之前超时时间已过，或者服务器没有响应.) 我们的网络上有 100 多种不同的应用程序，包括 Web 和桌面应用程序。从 VB6 和经典 ASP 到 .NET 4 的所有内容。我可以找到各种显示副作用的数据，但无法确定导致这种情况的原因。我们的 DBA 说 SQL 服务器没有问题，IT 说 Web 服务器或网络没有问题，所以我当然要尝试解决这个问题。

I'm really just looking for suggestions on what other troubleshooting I can do to try and track this down.

我真的只是在寻找关于我可以做些什么其他故障排除的建议来尝试跟踪这个问题。

We're running SQL Server 2008 R2 in a cluster. There's a handful of different servers that connect to it, ranging from Windows server 2003 to 2008 of different varieties.

我们在集群中运行 SQL Server 2008 R2。有少数不同的服务器连接到它，从 Windows server 2003 到 2008 的不同品种。

Here's what I've done so far:

这是我到目前为止所做的：

Run SQL trace of long running queries and deadlocks.This shows no deadlocks at the times of the problems, and long running queries all coincide with our timeout errors, but look to be a side effect, and not the cause. Queries that are very basic that typically return instantly end up taking 30, 60 or 120 seconds to run at times. This happens for a few minutes then everything picks up and works fine after that.
Use performance monitor to track connection pool connections.This sometimes shows some spikes in the number of connections near the times of the timeouts, but still not even halfway to the default 100 connection limit. Again, nothing here that seems to point to a cause.
Separate web applications into different App Pools.We tried to narrow down the apps we thought may be the main problem (most chatty, etc) and put them in separate Application Pools but that doesn't seem to affect anything or help us narrow down anything.
Monitor disk usage on SQL Server.We've done some monitoring on the SQL server and see no spikes or any signs of problems when these timeouts are occurring.
Verified TempDBwas not the cause of the problem.

运行 SQL 跟踪长时间运行的查询和死锁。这表明在出现问题时没有死锁，长时间运行的查询都与我们的超时错误一致，但看起来是副作用，而不是原因。非常基本的查询通常会立即返回，但有时需要 30、60 或 120 秒才能运行。这种情况会发生几分钟，然后一切都会好起来并在此之后正常工作。
使用性能监视器来跟踪连接池连接。这有时会显示超时时间附近的连接数出现一些峰值，但仍然没有达到默认的 100 个连接限制的一半。同样，这里没有任何东西似乎指向一个原因。
将 Web 应用程序分离到不同的应用程序池中。我们试图缩小我们认为可能是主要问题的应用程序（最健谈等）并将它们放在单独的应用程序池中，但这似乎不会影响任何事情或帮助我们缩小任何事情的范围。
监视 SQL Server 上的磁盘使用情况。我们已经对 SQL 服务器进行了一些监控，当这些超时发生时，没有看到峰值或任何问题迹象。
经验证的 TempDB不是问题的原因。

I'll come back and add more if I think of what else we've tried. Please let me know some ideas on what to troubleshoot next.

如果我想到我们尝试过的其他内容，我会回来添加更多内容。请让我知道有关下一步要解决的问题的一些想法。

Answer 1

采纳答案by Peter

Run SQL trace of long running queries and deadlocks. This shows no deadlocks at the times of the problems, and long running queries all coincide with our timeout errors, but look to be a side effect, and not the cause. Queries that are very basic that typically return instantly end up taking 30, 60 or 120 seconds to run at times. This happens for a few minutes then everything picks up and works fine after that.

运行 SQL 跟踪长时间运行的查询和死锁。这表明在出现问题时没有死锁，长时间运行的查询都与我们的超时错误一致，但看起来是副作用，而不是原因。非常基本的查询通常会立即返回，但有时需要 30、60 或 120 秒才能运行。这种情况会发生几分钟，然后一切都会好起来并在此之后正常工作。

It looks like some queries/transaction lock your database till they are done. You have to find out which queries are blocking and rewrite them/run them at an other time to avoid blocking other processes. At this moment the waiting queries just timeout.

看起来有些查询/事务会锁定您的数据库，直到它们完成。您必须找出哪些查询被阻塞并重写它们/在其他时间运行它们以避免阻塞其他进程。此时等待的查询只是超时。

An extra point to dig into is the auto increment size of your transaction log and database. Set them on a fixed size instead of a percentage of the current files. If files are getting taller the time it takes to allocate enough space will eventually longer as your transaction timeout. And your db comes to a halt.

需要深入研究的一点是事务日志和数据库的自动增量大小。将它们设置为固定大小而不是当前文件的百分比。如果文件越来越大，分配足够空间所需的时间最终会随着事务超时而变长。你的数据库停止了。

Answer 2

回答by Matt Faus

Performance problems boil down to CPU, IO, or Lock contention. It sounds like you have ruled out IO. I would guess CPU is not a problem since this is a database, not a number cruncher. So, that leaves lock contention.

性能问题归结为 CPU、IO 或锁争用。听起来您已经排除了 IO。我猜 CPU 不是问题，因为这是一个数据库，而不是数字运算器。所以，这就留下了锁争用。

If you can execute a sp_who2 while the queries are timing out, you can use the BlkBy column to trace back to the holding the lock that everyone else is waiting on. Since this is only happening a few times a day, you may have trouble catching enough data if you are running this manually, so I suggest you rig up an automated system to dump this output on a regular basis, or maybe to be triggered by the application timeout exceptions. You can also use the Activity Monitor to watch the degradation of query responsiveness in real-time, as suggested by peer.

如果您可以在查询超时时执行 sp_who2，则可以使用 BlkBy 列追溯到持有其他所有人都在等待的锁。由于这种情况每天只发生几次，如果您手动运行它，您可能无法捕获足够的数据，因此我建议您安装一个自动化系统来定期转储此输出，或者可能被触发应用程序超时异常。您还可以使用活动监视器实时观察查询响应能力的下降，如同行所建议的那样。

Once you find the long-running query and the application that executes it, you can immediately resolve the domino of timeouts by reducing the timeout for that single application below all the others (right now, it must be longer). Then, you should inspect the code to determine a better solution. You could reduce the time the lock is held by committing the transaction sooner within a sproc, or reduce the lock required by the reading query with hints such as NOLOCK or UPDLOCK.

一旦找到长时间运行的查询和执行它的应用程序，您就可以立即解决超时问题，方法是将单个应用程序的超时时间减少到低于所有其他应用程序的时间（现在，它必须更长）。然后，您应该检查代码以确定更好的解决方案。您可以通过在 sproc 内更快地提交事务来减少持有锁的时间，或者使用 NOLOCK 或 UPDLOCK 等提示减少读取查询所需的锁。

Here's some more reading on sp_who2: http://sqlserverplanet.com/dba/using-sp_who2/

这里有一些关于 sp_who2 的更多阅读：http://sqlserverplanet.com/dba/using-sp_who2/

And query hints: http://msdn.microsoft.com/en-us/library/ms181714.aspx http://msdn.microsoft.com/en-us/library/ms187373.aspx

和查询提示：http: //msdn.microsoft.com/en-us/library/ms181714.aspx http://msdn.microsoft.com/en-us/library/ms187373.aspx

Answer 3

回答by Andrew

Bit of a long shot, but on a lab a while back, we had a situation where a SQL Server appeared unresponsive, not because we had spiked the CPU or anything we could track within SQL Server, it appeared operational to all tests but connections failed under some load.

有点远，但在不久前的实验室中，我们遇到了 SQL Server 没有响应的情况，这不是因为我们在 SQL Server 中对 CPU 或任何我们可以跟踪的东西进行了尖峰测试，它似乎对所有测试都可以运行，但连接失败在一些负载下。

The issue turned out to be due to the volume of traffic against the server meant we were triggering the in built windows Syn Attack Flood Protection within Windows. Annoyingly when you hit this, there is no logged message within windows server, or within SQL - you only see the symtpoms which are connections failing to be made - this is because windows slows down on accepting the messages and let's a queue build. From the connection standpoint, the server appears to not respond when it should (it doesn't even acknowledge the message arrived)

事实证明，该问题是由于针对服务器的流量意味着我们在 Windows 中触发了内置的 Windows Syn Attack Flood Protection。令人讨厌的是，当您点击此按钮时，Windows 服务器或 SQL 中没有记录的消息-您只能看到无法建立连接的符号-这是因为 Windows 接受消息的速度变慢，让我们建立一个队列。从连接的角度来看，服务器似乎没有响应（它甚至不确认消息到达）

http://msdn.microsoft.com/en-us/library/ee377084(v=bts.10).aspx

Scroll down to SynAttackProtect and you will see the default in windows server 2003 sp1 onwards was to enable this feature by default. It is a DDOS protection mechanism in effect, and the lack of logging that it is triggering makes it incredibly difficult to detect when your server does this.

向下滚动到 SynAttackProtect，您将看到 Windows Server 2003 sp1 以后的默认设置是默认启用此功能。它实际上是一种 DDOS 保护机制，并且由于它触发的日志记录的缺失，因此很难检测到您的服务器何时执行此操作。

It took 3 days within the MS lab before it was figured out.

在 MS 实验室花了 3 天时间才弄清楚。

You mentioned 100 conenctions, we had an app that constantly connected, ran queries and then disconnected, it did not hold the connections open. This meant that we had multiple threads on each machine connectiong doing this, 10 machines, multiple threads per machine, and it was considered enough different connections consistently being made / dropped to trigger the defense.

你提到了 100 个连接，我们有一个应用程序不断连接，运行查询然后断开连接，它没有保持连接打开。这意味着我们在每台机器连接上有多个线程来执行此操作，10 台机器，每台机器有多个线程，并且它被认为有足够的不同连接持续建立/删除以触发防御。

Whether you are at that level (since it is not a clearly defined threshold by MS) is hard to say.

你是否处于那个级别（因为它不是 MS 明确定义的阈值）很难说。

Answer 4

回答by Michael Fredrickson

Like the other posters have suggested, it sounds like you have a lock contention issue. We faced a similar issue a few weeks back; however, ours was much more intermittent, and often cleared up before we could get a DBA onto the server to run sp_who2 to trace down the issue.

就像其他海报所建议的那样，听起来您遇到了锁争用问题。几周前我们遇到了类似的问题；然而，我们的问题更加断断续续，而且经常在我们让 DBA 到服务器上运行 sp_who2 来跟踪问题之前就被清除了。

What we ended up doing was implement an e-mail notification if a lock exceeded a certain threshold. Once we put this in place, we were able to identify the processes that were locking, and change the isolation level to read uncommitted where appropriate to fix the issue.

我们最终做的是在锁超过某个阈值时实现电子邮件通知。一旦我们把它放在适当的位置，我们就能够识别被锁定的进程，并在适当的地方将隔离级别更改为读取未提交以解决问题。

Here's an article that provides an overview of how to configure this type of notification.

这是一篇文章，概述了如何配置此类通知。

If locking turns out to be the issue, and if you're not already doing so, I would suggest looking into configuring row versioning-based isolation levels.

如果锁定是问题所在，并且您还没有这样做，我建议您考虑配置基于行版本控制的隔离级别。

Answer 5

回答by n8wrl

You are on the right track with your tracing and profiling. what you need to do is look for what the queries that time-out have in common - it is likely they will all be hitting a small subset of tables or indexes. I suspect some application has a long-running update/insert that affects queries on tables that use indexes affected by the updates/inserts.

您的跟踪和分析走在正确的轨道上。您需要做的是寻找超时的查询有什么共同点 - 它们很可能都会命中表或索引的一小部分。我怀疑某些应用程序有一个长时间运行的更新/插入，这会影响对使用受更新/插入影响的索引的表的查询。

You have to work backwards a little - given the subset of tables you see timing out, see what indexes are on those tables. Look for other queries that are running at the smae time that touch those tables/indexes. I'm betting you will find a small set of updates/inserts doing this.

您必须稍微向后工作 - 鉴于您看到超时的表子集，请查看这些表上的索引。查找在接触这些表/索引的 smae 时间运行的其他查询。我打赌你会发现一小组更新/插入这样做。

Then you have some decisions to make. One option is to changing the locking hints on the queries that are timing out. But that is geenrally bad practice because it will mask the real problem for a while. While you mgiht see the timeouts go away for a while, depending on the hint you choose you might end up with dirty reads and then bogus data coming back from those queries. That might turn out to be worse than the timeouts - hard to say.

然后你需要做出一些决定。一种选择是更改超时查询的锁定提示。但这通常是不好的做法，因为它会暂时掩盖真正的问题。虽然您看到超时消失了一段时间，但根据您选择的提示，您可能最终会得到脏读，然后从这些查询中返回虚假数据。这可能比超时更糟糕 - 很难说。

Best bet is to figure out which of your applications are submitting the update/inserts you found and dig in to figure out why they take so long.

最好的办法是找出您的哪些应用程序正在提交您找到的更新/插入内容，并深入了解为什么它们需要这么长时间。

Answer 6

回答by Simon Mourier

I suggest you have a deep look at the super cool SQL Server's Dynamic Management Viewsfeature:

我建议你深入了解一下超级酷的 SQL Server 的动态管理视图功能：

Dynamic management views and functions return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance.

动态管理视图和函数返回服务器状态信息，可用于监视服务器实例的健康状况、诊断问题和调整性能。

This article is a good start with DMVs, although it was written for SQL 2005 (DMVs feature first appearance): Troubleshooting Performance Problems in SQL Server 2005, especially the 'blocking' chapters.

本文是 DMV 的良好开端，尽管它是为 SQL 2005 编写的（DMV 功能首次出现）：SQL Server 2005 中的性能问题疑难解答，尤其是“阻塞”章节。

Answer 7

回答by Carth

Sounds like you may already have your answer but in case you need one more place to look you may want to check out the size and activity of your temp DB. We had an issue like this once at a client site where a few times a day their performance would horribly degrade and occasionally timeout. The problem turned out to be a separate application that was thrashing the temp DB so much it was affecting overall server performance.

听起来您可能已经有了答案，但如果您需要更多地方查看，您可能需要查看临时数据库的大小和活动。我们曾经在客户站点遇到过这样的问题，一天几次，他们的性能会严重下降，有时还会超时。问题原来是一个单独的应用程序，它对临时数据库的影响太大了，以至于影响了整体服务器性能。

Good luck with the continued troubleshooting!

祝您继续排除故障！

Answer 8

回答by UnhandledExcepSean

I've seen similar problems happen if anti-virus was installed on the SQL server. The AV's auto-update features were clocking the server and not allowing enough CPU for SQL Server.

如果在 SQL 服务器上安装了防病毒软件，我已经看到类似的问题发生。AV 的自动更新功能正在为服务器计时，并且不允许为 SQL Server 提供足够的 CPU。

Also, have you put a small application on the SQL server itself that verifies that connections can be made or runs very basic SQL like "SELECT GETDATE();"? This would eliminate network possibilities.

此外，您是否在 SQL 服务器上放置了一个小应用程序来验证是否可以建立连接或运行非常基本的 SQL，例如“SELECT GETDATE();”？这将消除网络可能性。

Answer 9

回答by Balmukund Lakhani

Since I do troubleshooting everyday as a part of my job, here is what I would like to do:

由于我每天都在做故障排除作对我来说有效的一部分，因此我想做的是：

Since it's SQL Server 2008 R2, you can run SQLDiag which comes as a part of the product. You can refer books online for more details. In brief, capture Server Side trace and blocker script.
Once trace is captured, look for "Attention" event. That would be the spid which has received the error. If you filter by SPID, you would see RPC:Completed event before "Attention". Check the time over there. Is that time 30 seconds? If yes, then client waited for 30 second to get response from SQL and got "timed out" [This is client setting as SQL would never stop and connection]
Now, check if the query which was running really should take 30 seconds?
If yes then tune the query or increase the timeout setting from the client.
If no then this query must be waiting for some resources (blocked)
At this point go back to Blocker Script and check the time frame when "Attention" came

由于它是 SQL Server 2008 R2，您可以运行作为产品一部分提供的 SQLDiag。您可以在线查阅书籍以获取更多详细信息。简而言之，捕获服务器端跟踪和阻止程序脚本。
捕获跟踪后，查找“注意”事件。那将是收到错误的 spid。如果您按 SPID 筛选，您会在“注意”之前看到 RPC:Completed 事件。看看那边的时间。那个时间是30秒吗？如果是，则客户端等待 30 秒从 SQL 获得响应并“超时”[这是客户端设置，因为 SQL 永远不会停止和连接]
现在，检查正在运行的查询是否真的需要 30 秒？
如果是，则调整查询或增加客户端的超时设置。
如果否，则此查询必须等待某些资源（已阻止）
此时返回阻止程序脚本并检查“注意”出现的时间范围

Above is assuming that issue is with SQL Server not network related!

以上假设问题与 SQL Server 无关！

Answer 10

回答by MarianP

My experience with these issues (not on SQL Server though) is that overdone multi-tasking is often the cause of the problem. If there is similar/connected data/tables queried at (almost) the same time by many connections, the DBMS may have trouble keeping all the isolation at check. This is not that much of an issue of disk usage as to making some connections wait for things to be done by other ones. Synchronization is very expensive in terms of CPU usage.

我对这些问题的经验（虽然不是在 SQL Server 上）是过度的多任务处理通常是问题的原因。如果许多连接（几乎）同时查询了相似/连接的数据/表，则 DBMS 可能无法检查所有隔离。这并不是一个磁盘使用问题，而是让某些连接等待其他连接完成的事情。就 CPU 使用率而言，同步非常昂贵。

The 100 connections is way too much in my opinion. (In my experience again) even 20 connections asked to be done by one machine may be overly optimistic.

在我看来，100 个连接太多了。（再次以我的经验）即使要求一台机器完成 20 个连接也可能过于乐观。

如何解决间歇性 SQL 超时错误

提问by Shawn Steward

采纳答案by Peter

回答by Matt Faus

回答by Andrew

回答by Michael Fredrickson

回答by n8wrl

回答by Simon Mourier

回答by Carth

回答by UnhandledExcepSean

回答by Balmukund Lakhani

回答by MarianP

相关推荐

最近更新

标签

如何解决间歇性 SQL 超时错误

提问by Shawn Steward

采纳答案by Peter

回答by Matt Faus

回答by Andrew

回答by Michael Fredrickson

回答by n8wrl

回答by Simon Mourier

回答by Carth

回答by UnhandledExcepSean

回答by Balmukund Lakhani

回答by MarianP

相关推荐

SQL Oracle 中删除和删除清除的区别

我们需要为主键指定“非空”吗？甲骨文/SQL

删除 PL/SQL 中特定字符后的所有字符

SQL 过程可以返回表吗？

相关推荐

最近更新

标签