C# Windows 服务增加 CPU 消耗
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Windows Service Increasing CPU Consumption
提问by TheSmurf
At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
在我的工作中,我负责使用 C# 2003 编写的六个 Windows 服务。这些服务中的每一个都包含一个计时器,每分钟左右触发一次,它们的大部分工作发生在那里。
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
我的问题是,随着这些服务的运行,它们开始在循环的每次迭代中消耗越来越多的 CPU 时间,即使它们没有任何有意义的工作要做(即,它们只是空闲,查看数据库做某事)。当它们启动时,每个服务平均使用(大约)4 个 CPU 的 2-3%,这很好。24 小时后,每个服务将在其循环运行期间消耗整个处理器。
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don'twant to be doing is restarting them more frequently.
任何人都可以帮忙吗?我不知道是什么原因造成的。我们当前的解决方案是每天重新启动一次服务(它们自己关闭,然后脚本看到它们处于离线状态并在凌晨 3 点左右重新启动它们)。但这不是一个长期的解决方案;我担心的是,随着服务变得越来越忙,每天重新启动一次可能还不够……但是因为有一个显着的启动惩罚(它们都使用 NHibernate 进行数据访问),随着它们变得越来越忙,这正是我们不做的想要做的是更频繁地重新启动它们。
@akmad: True, it is very difficult.
@akmad:是的,这非常困难。
- Yes, a service run in isolation will show the same symptom over time.
- No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
- We do; and they are. The services are doing exactly what they should be, and nothing else.
- Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
- No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
- 是的,独立运行的服务会随着时间的推移表现出相同的症状。
- 不,它没有。我们已经看过了。这可能发生在上午 10 点或下午 6 点或半夜。没有一致性。
- 我们的确是; 他们是。这些服务正在做他们应该做的事情,没有别的。
- 不幸的是,这需要预先知道服务何时将最大限度地使用 CPU,这发生在不可预测的时间表上,而且永远不会很快......这使事情变得更加困难,因为我的老板会在他们开始时运行并重新启动它们问题而不考虑调试问题。
- 不,他们使用的 RAM 量相当一致(每个大约 60-80MB,机器上有 4GB)。
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
好的建议,但请放心,我们已经尝试了所有常见的故障排除方法。我希望这是一个有人可能知道的 .NET 问题,我们可以努力解决。我老板的解决方案(我强调不想实施)是在数据库中放置一个字段,该字段可以在白天多次重启服务,这样他就可以让问题消失而不去想它. 我正在拼命寻找真正问题的原因,以便我可以解决它,因为该解决方案将在大约六个月内变成一场灾难。
@Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
@Yaakov Ellis:它们每个都有不同的功能。一种是从异地某处的 Oracle 数据库中读取记录;另一个处理这些记录并将属于这些记录的文件传输到我们的系统;第三个检查这些文件以确保它们符合我们的预期;另一个是维护服务,它不断检查磁盘空间(我们有足够的空间)等内容并轮询其他服务器以确保它们处于活动状态;一个运行只是为了确保所有这些其他的都在运行并完成他们的工作,监视和报告错误,并重新启动任何未能保持整个系统一天 24 小时运行的东西。
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
因此,如果您问我认为您在问什么,不,所有这些服务(通过 NHibernate 访问数据库除外)都做的一件常见事情我可以指出为潜在问题。不幸的是,如果事实证明这是真正的问题(这不会让我感到惊讶),整个事情可能会被搞砸——我最终会用简单的 SQL 重写所有这些。我希望这是一个垃圾收集器问题或者比 NHibernate 更容易处理的问题。
@Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
@Joshdan:没有秘密。正如我所说,我们已经尝试了所有常见的故障排除方法。分析没有帮助:我们使用的分析器无法指向 CPU 使用率高时实际执行的任何代码。大约一个月前,为了寻找这个问题,这些服务被撕裂了。分析每一段代码,试图找出我们的代码是否有问题;我不是在这里问,因为我还没有做功课。如果这是一个简单的服务比预期做更多的工作的情况,那将会被发现。
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
这里的问题是,在大多数情况下,服务根本不做任何事情,但仍然设法消耗 4 个 CPU 内核的 25% 或更多:他们发现没有工作要做,退出循环并等待下一次迭代。从字面上看,这应该几乎不占用 CPU 时间。
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
这是我们看到的行为示例,在两天没有工作的服务上(在不变的环境中)。这是上周捕获的:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
第 1 天,上午 8 点:平均 CPU 使用率约 3%
第 1 天,下午 6 点:平均。CPU 使用率约 8%
第 2 天,早上 7 点:平均。CPU 使用率约 20%
第 2 天,上午 11 点:平均。CPU 使用率约 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn'tthought of (like Ian's suggestion).
在查看了所有可能的平凡原因之后,我在这里问了这个问题,因为我认为(事实证明是正确的)我会得到更多创新的答案(如 Ubiguchi 的),或者指向我没有的东西的指针t想到(如伊恩的建议)。
So does the CPU spike happen immediately preceding the timer callback, within the timer callback, or immediately following the timer callback?
那么 CPU 峰值是在定时器回调之前、定时器回调内还是定时器回调之后立即发生?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
你误会了。这不是尖峰。如果是,就没有问题;我可以处理尖峰。但它不是...... CPU 使用率普遍上升。即使服务什么都不做,等待下一个定时器命中。当服务启动时,一切都很好而且很平静,图表看起来像你期望的那样......通常,使用率为 0%,当 NHibernate 命中数据库或服务执行一些微不足道的工作时,使用率会飙升至 10% . 但这会在进程运行时始终增加到全面的 25%(如果我让它走得太远,则使用率会更高)。
That made Ian's suggestion the logical silver bullet (NHibernate does a lotof stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seemingto go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
这使得 Ian 的建议成为合乎逻辑的灵丹妙药(NHibernate在你不看的时候会做很多事情)。唉,我已经实施了他的解决方案,但它没有产生效果(我没有证据证明这一点,但我实际上认为这让事情变得更糟......平均使用率似乎现在上升得更快)。请注意,剥离 NHibernate“部分”(如您所建议的)是不可行的,因为这将剥离服务中大约 90% 的代码,这将使我排除计时器问题(我绝对打算这样做)尝试),但不能帮助我排除 NHibernate 作为问题,因为如果 NHibernate 导致了这个问题,那么实施的狡猾修复(见下文)将不得不成为系统工作方式;我们在这个项目中非常依赖 NHibernate,以至于 PM 根本不会接受它会导致无法解决的结构问题。
I just noted a sense of desperation in the question -- that your problems would continue barring a small miracle
我只是注意到问题中有一种绝望感——除非出现小奇迹,否则您的问题将继续存在
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
不要让它以这种方式脱落。目前,这些服务每天都在重新启动(可以选择输入一天中的任意小时数来关闭和重新启动),这可以修补问题,但一旦进入生产机器就不能成为长期解决方案并开始变得忙碌。问题不会继续存在,无论是我修复它们还是 PM 对它们保持这种约束。显然,我更愿意实施真正的修复,但由于最初的测试没有发现原因,而且服务已经过广泛审查,PM 宁愿让它们重新启动多次,也不愿花更多时间尝试修复它们. 这完全超出了我的控制范围,这使得你所说的奇迹比其他情况下更重要。
That is extremely intriguing (insofar as you trust your profiler).
这是非常有趣的(只要您信任您的分析器)。
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
我不。但是,这些是用 .NET 1.1 编写的 Windows 服务,运行在 Windows 2000 机器上,由狡猾的 Nant 脚本部署,使用旧版本的 NHibernate 进行数据库访问。那台机器上几乎没有我会说我信任的东西。
回答by akmad
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
远程调试未知的应用程序显然非常困难……但我会考虑以下几点:
- What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
- Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
- Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
- If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
- You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
- 当您一次只运行一项服务时会发生什么?你还看到减速吗?这可能表明服务之间存在一些争用。
- 无论服务运行了多长时间,问题是否总是在同一时间出现?这可能表明其他原因(备份、病毒扫描等)导致机器(或数据库)整体变慢。
- 您是否有日志记录或其他一些机制来确保服务只按照您认为应该的频率工作?
- 如果您可以在短时间内看到性能下降,请尝试运行该服务一段时间,然后附加一个分析器以查看究竟是什么与 CPU 挂钩。
- 您没有提及有关内存使用的任何内容。你有这些服务的任何信息吗?有可能您用尽了大部分 RAM 并导致磁盘成为垃圾,或出现一些类似问题。
Best of luck!
祝你好运!
回答by Ubiguchi
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
'害怕这个答案只会建议一些方向供您查看,但是在 .NET Windows 服务中看到类似的问题后,我有一些想法可能会对您有所帮助。
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
我的第一个建议是您的服务可能在处理内存的方式或处理非托管内存的方式方面存在一些错误。上次我追踪到一个类似的问题时,结果是我们使用存储的句柄来处理静态内存中的非托管对象的第 3 方 OSS 库。服务运行的时间越长,服务获取的句柄就越多,这会导致进程的 CPU 性能迅速下降。尝试解决此类问题以确保您的服务在计时器调用之间的内存中不存储任何内容的方法,尽管如果您的第 3 方库使用静态内存,您可能需要做一些聪明的事情,例如为计时器调用创建一个应用程序域并丢弃处理完成后应用程序域(及其静态内存)。
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
我在类似情况下看到的另一个问题是可疑的计时器同步代码,这实际上允许多个线程同时运行处理代码。当我们调试代码时,我们发现第一个线程阻塞了第二个线程,当第二个线程开始时,第三个线程被阻塞了。随着时间的推移,阻塞持续的时间越来越长,因此 CPU 使用率达到顶峰。我们用来解决这个问题的解决方案是实现正确的同步代码,这样计时器只会在不会被阻塞的情况下启动另一个线程。
Hope this helps, but apologies up front if both my thoughts are red herrings.
希望这会有所帮助,但如果我的想法都是红鲱鱼,请提前道歉。
回答by Ubiguchi
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
听起来像是计时器的线程问题。您可能有一个工作单元阻止另一个在不同的工作线程上运行,导致它们在每次计时器触发时堆积起来。或者,您的实例可能比您预期的寿命和工作时间更长。
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
我建议重构计时器。将其替换为在 ThreadPool 上排队工作的单个线程。您可以 Sleep() 线程来控制它寻找新工作的频率。确保这是您的代码是多线程的唯一地方。所有其他对象都应该在工作准备好进行处理时实例化,并在工作完成后销毁。状态是多线程代码中的敌人。
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
另一个缺乏设计的领域似乎是您有多个服务轮询资源来做某事。我建议将它们统一在一个服务下。他们可能会做不同的事情,但他们会齐心协力;您只是使用文件系统、数据库等来替代方法调用。还有,2003?我为你感到伤心。
回答by Andrea Bertani
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
我建议把问题分成几部分。
首先,找到一种方法可以 100% 且快速地重现问题。降低计时器以使服务更频繁地启动(例如,比正常情况快 10 倍)。如果问题出现的速度快 10 倍,那么它与迭代次数有关,而与实时或服务完成的实际工作无关)。并且您将能够比每天一次更快地执行后续步骤。
其次,注释掉所有真正的工作代码,只让服务、定时器和同步机制。如果问题仍然出现,那么它将出现在代码的那部分。如果没有,则开始添加您注释掉的代码,一次一个。最终,您应该找出导致问题的代码部分。
回答by Ian Nelson
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
您提到您正在使用 NHibernate - 您是否在适当的点(例如每次迭代结束时)关闭了 NHibernate 会话?
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
如果不是,那么加载到内存中的对象映射的大小将随着时间的推移逐渐增加,并且每次会话刷新将花费越来越多的 CPU 时间。
回答by Joshdan
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
好的建议,但请放心,我们已经尝试了所有常见的故障排除方法。我希望这是一个有人可能知道的 .NET 问题,我们可以努力解决。
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
我的感觉是,无论根本原因多么奇怪,通常的故障排除步骤都是定位问题的最佳选择。
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Whereis your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
由于这是一个性能问题,良好的测量是无价的。整个进程的 CPU 使用率的测量范围太广了。 您的服务将时间花在哪里?您可以使用分析器来衡量这一点,或者只记录各个部分的开始和停止。如果您甚至无法做到这一点,请使用 Andrea Bertani 的建议——通过删除其他部分来隔离部分。
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
一旦定位了一般区域,您就可以进行更细粒度的测量,直到找出 CPU 使用率的来源。如果此时如何修复它并不明显,那么您至少有针对更具体问题的弹药。
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.
如果您实际上已经完成了所有这些常见的故障排除,请让我们了解这个秘密。
回答by Mark Brackett
Here's where I'd start:
这是我要开始的地方:
- Get Process Explorerand show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
- You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
- Compare before and after shots.
- 获取Process Explorer并显示 JIT 中的 %Time、GC 中的 %Time、CPU 周期增量、CPU 时间、CPU % 和线程。
- 您还需要内核和用户时间,以及一些有代表性的堆栈跟踪,但我认为您必须点击属性才能获取快照。
- 对比拍摄前后。
A couple of thoughts on possibilities:
关于可能性的一些想法:
- excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
- excessive threads and associated context switches (# of threads going up)
- polling (stack traces are consistently caught in a single function)
- excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
- exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
- virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
- GC 过多(GC 时间增加。另外,Perfmon GC 和 CPU 计数器会对应)
- 过多的线程和相关的上下文切换(线程数量上升)
- 轮询(堆栈跟踪始终在单个函数中捕获)
- 过多的内核时间(内核时间很长 - 任务管理器在 CPU 高时显示大量内核时间)
- 异常(PE .NET 选项卡抛出的异常很高并且越来越高。还有一个 Perfmon 计数器)
- 病毒/rootkit(好吧,这是最后一个方案 - 但可以构建一个隐藏在 TaskManager 中的 rootkit。如果你足够狡猾,我怀疑你可以将不可避免的 CPU 使用分配给另一个进程。此外,如果您已经排除了上述所有内容,那么我现在没有想法)