C++ Win32下堆损坏;如何定位?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1069/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Heap corruption under Win32; how to locate?
提问by Josh
I'm working on a multithreadedC++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lotof code changes in the repository.
我正在开发一个破坏堆的多线程C++ 应用程序。定位此损坏的常用工具似乎不适用。源代码的旧版本(18 个月大)表现出与最新版本相同的行为,所以这已经存在很长时间了,只是没有被注意到;不利的一面是,无法使用源增量来识别引入错误的时间 -存储库中有很多代码更改。
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
崩溃行为的提示是在这个系统中产生吞吐量 - 数据的套接字传输被修改为内部表示。我有一组测试数据会定期导致应用程序异常(各种地方,各种原因 - 包括堆分配失败,因此:堆损坏)。
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
该行为似乎与 CPU 功率或内存带宽有关;每台机器的数量越多,就越容易崩溃。禁用超线程内核或双核内核会降低(但不会消除)损坏率。这表明与时间相关的问题。
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6
) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc;
when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9
or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%
, disk light is not on, the program's going as fast it can, box consuming 1.3G
of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
现在问题来了:
当它在轻量级调试环境(比如Visual Studio 98 / AKA MSVC6
)下运行时,堆损坏很容易重现——十到十五分钟后就会发生可怕的故障和异常,就像alloc;
在复杂的调试环境(Rational Purify,VS2008/MSVC9
甚至 Microsoft 应用程序验证器)系统变得受内存速度限制并且不会崩溃(内存限制:CPU 没有超过50%
,磁盘灯不亮,程序运行尽可能快,盒子消耗1.3G
2G 的 RAM) . 因此,我可以在能够重现问题(但不能确定原因)或能够确定原因或无法重现的问题之间做出选择。
My current best guesses as to where to next is:
我目前对下一步的最佳猜测是:
- Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an
E6550 Core2 Duo
); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or - Rewrite operators
new
anddelete
to useVirtualAlloc
andVirtualProtect
to mark memory as read-only as soon as it's done with. Run underMSVC6
and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewritesnew
anddelete
?! I wonder if this is going to make it as slow as under Purify et al.
- 获得一个非常糟糕的盒子(替换当前的开发盒子:一个 2Gb 内存
E6550 Core2 Duo
);这将使在强大的调试环境下运行时重现导致错误行为的崩溃成为可能;或者 - 重写运算符,
new
并delete
在完成后立即使用VirtualAlloc
和VirtualProtect
将内存标记为只读。运行MSVC6
并让操作系统捕获正在写入释放内存的坏人。是的,这是绝望的标志:谁是地狱重写new
和delete
?!我想知道这是否会使它像 Purify 等人一样慢。
And, no: Shipping with Purify instrumentation built in is not an option.
并且,不:内置 Purify 仪器运输不是一种选择。
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
一个同事刚走过来问“堆栈溢出?我们现在有堆栈溢出吗?!?”
And now, the question: How do I locate the heap corruptor?
现在,问题是:如何找到堆损坏程序?
Update: balancing new[]
and delete[]
seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
更新:平衡new[]
并且delete[]
似乎在解决问题方面已经走了很长一段路。该应用程序现在在崩溃前大约需要两个小时,而不是 15 分钟。还没有。有什么进一步的建议吗?堆损坏仍然存在。
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL
implementation that ships with VS98
.
更新:Visual Studio 2008 下的发布版本似乎要好得多;目前怀疑停留在STL
附带的实施VS98
。
- Reproduce the problem.
Dr Watson
will produce a dump that might be helpful in further analysis.
- 重现问题。
Dr Watson
将产生可能有助于进一步分析的转储。
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
我会记录下来,但我担心 Watson 博士只会在事后被绊倒,而不是在堆被踩踏时。
Another try might be using
WinDebug
as a debugging tool which is quite powerful being at the same time also lightweight.
另一种尝试可能是
WinDebug
用作调试工具,它非常强大,同时也是轻量级的。
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
再次说明这一点:在出现问题之前没有太大帮助。我想在行动中抓住破坏者。
Maybe these tools will allow you at least to narrow the problem to certain component.
也许这些工具至少可以让您将问题缩小到某些组件。
I don't hold much hope, but desperate times call for...
我不抱太大希望,但绝望的时代需要……
And are you sure that all the components of the project have correct runtime library settings (
C/C++ tab
, Code Generation category in VS 6.0 project settings)?
并且您确定项目的所有组件都具有正确的运行时库设置(
C/C++ tab
VS 6.0 项目设置中的代码生成类别)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
不,我不是,明天我将花几个小时浏览工作区(其中有 58 个项目)并检查它们是否都在编译并与适当的标志链接。
更新:这花了 30 秒。选择
Settings
Settings
对话框中的所有项目,取消选择,直到找到没有正确设置的项目(它们都有正确的设置)。采纳答案by Josh
My first choice would be a dedicated heap tool such as pageheap.exe.
我的第一选择是专用的堆工具,例如pageheap.exe。
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc API
s using Microsoft Detours.
重写 new 和 delete 可能有用,但这并不能捕获低级代码提交的分配。如果这是您想要的,最好low-level alloc API
使用 Microsoft Detours绕道s。
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
还有健全性检查,例如:验证您的运行时库是否匹配(发布与调试、多线程与单线程、dll 与静态库)、查找错误删除(例如,删除 delete [] 应该是使用),请确保您没有混合和匹配您的分配。
Also try selectively turning off threads and see when/if the problem goes away.
还可以尝试有选择地关闭线程并查看问题何时/是否消失。
What does the call stack etc look like at the time of the first exception?
第一个异常发生时的调用堆栈等是什么样的?
回答by Michal Sznajder
I have same problems in my work (we also use VC6
sometimes). And there is no easy solution for it. I have only some hints:
我在工作中遇到了同样的问题(我们VC6
有时也会使用)。并且没有简单的解决方案。我只有一些提示:
- Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfectfor dumping.
- Remove all catch(...)from your code. They often hide serious memory exceptions.
- Check Advanced Windows Debugging- there are lots of great tips for problems like yours. I recomend this with all my heart.
- If you use
STL
trySTLPort
and checked builds. Invalid iterator are hell.
- 尝试在生产机器上使用自动故障转储(请参阅Process Dumper)。根据我的经验,Watson 博士并不适合倾销。
- 从您的代码中删除所有catch(...)。它们经常隐藏严重的内存异常。
- 检查高级 Windows 调试- 有很多很好的技巧可以解决像您这样的问题。我全心全意地推荐这个。
- 如果您使用
STL
尝试STLPort
和检查构建。无效的迭代器是地狱。
Good luck. Problems like yours take us months to solve. Be ready for this...
祝你好运。像您这样的问题需要我们几个月才能解决。准备好这...
回答by Tal
Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.
运行原始应用程序ADplus -crash -pn appnename.exe
当内存问题弹出时,您将获得一个不错的大转储。
You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32
heap and figure what was the orignal memory characteristics. (heap -x might help)
您可以分析转储以找出损坏的内存位置。如果幸运的话,覆盖内存是一个唯一的字符串,您可以找出它的来源。如果运气不好,您将需要深入研究win32
堆并弄清楚原始内存特征是什么。(heap -x 可能有帮助)
After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL
you monitor, or what allocation size to monitor.
在您知道出了什么问题后,您可以通过特殊的堆设置来缩小 appverifier 的使用范围。即,您可以指定DLL
要监视的内容或要监视的分配大小。
Hopefully this will speedup the monitoring enough to catch the culprit.
希望这将加速监控足以抓住罪魁祸首。
In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.
根据我的经验,我从不需要全堆验证器模式,但我花了很多时间分析故障转储和浏览源。
P.S:You can use DebugDiagto analyze the dumps.
It can point out the DLL
owning the corrupted heap, and give you other usefull details.
PS:您可以使用DebugDiag来分析转储。它可以指出DLL
拥有损坏的堆,并为您提供其他有用的详细信息。
回答by Graeme Perrow
We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.
通过编写我们自己的 malloc 和 free 函数,我们已经很幸运了。在生产中,他们只是调用标准的 malloc 和 free,但在调试中,他们可以为所欲为。我们还有一个简单的基类,它只覆盖 new 和 delete 操作符以使用这些函数,然后您编写的任何类都可以简单地从该类继承。如果您有大量代码,那么替换对 malloc 的调用并释放到新的 malloc 和 free(不要忘记 realloc!)可能是一项艰巨的工作,但从长远来看,这非常有帮助。
In Steve Maguire's book Writing Solid Code(highly recommended), there are examples of debug stuff that you can do in these routines, like:
在 Steve Maguire 的《Writing Solid Code》(强烈推荐)一书中,有一些可以在这些例程中执行的调试示例,例如:
- Keep track of allocations to find leaks
- Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
- memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)
- 跟踪分配以发现泄漏
- 分配比必要更多的内存,并在内存的开头和结尾放置标记——在空闲例程期间,您可以确保这些标记仍然存在
- 在分配(查找未初始化内存的使用情况)和空闲(查找空闲内存的使用情况)时使用标记对内存进行 memset
Another good idea is to neveruse things like strcpy
, strcat
, or sprintf
-- always use strncpy
, strncat
, and snprintf
. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.
另一个好主意是从来没有使用喜欢的东西strcpy
,strcat
或者sprintf
-始终使用strncpy
,strncat
和snprintf
。我们也编写了自己的这些版本,以确保我们不会注销缓冲区的末尾,并且这些版本也遇到了很多问题。
回答by Constantin
You should attack this problem with both runtime and static analysis.
您应该通过运行时和静态分析来解决这个问题。
For static analysis consider compiling with PREfast (cl.exe /analyze
). It detects mismatched delete
and delete[]
, buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4
not fixed.
对于静态分析,请考虑使用 PREfast ( cl.exe /analyze
)进行编译。它检测不匹配delete
和delete[]
,缓冲区溢出和许多其他问题。但是,请准备好应对数千字节的 L6 警告,尤其是在您的项目仍未L4
修复的情况下。
PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.
PREfast 可用于 Visual Studio Team System,并且显然是 Windows SDK 的一部分。
回答by Steve Steiner
Is this in low memory conditions? If so it might be that new is returning NULL
rather than throwing std::bad_alloc. Older VC++
compilers didn't properly implement this. There is an article about Legacy memory allocation failurescrashing STL
apps built with VC6
.
这是在内存不足的情况下吗?如果是这样,则可能是 new 正在返回NULL
而不是抛出 std::bad_alloc。较旧的VC++
编译器没有正确实现这一点。有一篇关于Legacy memory allocation failurescrashing STL
apps build with VC6
.
回答by Ignas Limanauskas
The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.
内存损坏的明显随机性听起来很像线程同步问题 - 根据机器速度重现错误。如果对象(内存块)在线程之间共享并且同步(临界区、互斥体、信号量、其他)原语不是基于每个类(每个对象、每个类),那么就有可能出现这种情况其中类(内存块)在使用时被删除/释放,或在删除/释放后使用。
As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.
作为测试,您可以向每个类和方法添加同步原语。这将使您的代码变慢,因为许多对象将不得不相互等待,但是如果这消除了堆损坏,您的堆损坏问题将成为代码优化问题。
回答by Mat Noguchi
So from the limited information you have, this can be a combination of one or more things:
因此,根据您拥有的有限信息,这可能是一项或多项内容的组合:
- Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAP_NO_SERIALIZE flag with allocs and frees from multiple threads on the same heap
- Out of memory
- Bad code (i.e., buffer overflows, buffer underflows, etc.)
- "Timing" issues
- 堆使用不当,即双重释放,释放后读取,释放后写入,使用 alloc 设置 HEAP_NO_SERIALIZE 标志并从同一堆上的多个线程中释放
- 内存不足
- 错误代码(即缓冲区溢出、缓冲区下溢等)
- “时间”问题
If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.
如果它是前两个而不是最后一个,那么您现在应该已经使用 pageheap.exe 捕获了它。
Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.
这很可能意味着这是由于代码访问共享内存的方式。不幸的是,追踪它会相当痛苦。对共享内存的非同步访问通常表现为奇怪的“时间”问题。诸如不使用获取/释放语义来使用标志同步对共享内存的访问,不适当地使用锁等。
At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.
至少,正如前面所建议的那样,能够以某种方式跟踪分配会有所帮助。至少这样您就可以查看在堆损坏之前实际发生的情况,并尝试从中进行诊断。
Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.
此外,如果您可以轻松地将分配重定向到多个堆,您可能想尝试一下,看看是否可以解决问题或导致更多可重现的错误行为。
When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)
当您使用 VS2008 进行测试时,您是否使用 HeapVerifier 并将 Conserve Memory 设置为 Yes?这可能会降低堆分配器的性能影响。(另外,您必须使用它运行 Debug->Start with Application Verifier,但您可能已经知道了。)
You can also try debugging with Windbg and various uses of the !heap command.
您还可以尝试使用 Windbg 和 !heap 命令的各种用途进行调试。
MSN
MSN
回答by Piotr Tyburski
My first action would be as follows:
我的第一个动作如下:
- Build the binaries in "Release" version but creating debug info file (you will find this possibility in project settings).
- Use Dr Watson as a defualt debugger (DrWtsn32 -I) on a machine on which you want to reproduce the problem.
- Repdroduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
- 在“发布”版本中构建二进制文件,但创建调试信息文件(您会在项目设置中找到这种可能性)。
- 在要重现问题的机器上使用 Dr Watson 作为默认调试器 (DrWtsn32 -I)。
- 重现问题。Watson 博士将生成可能有助于进一步分析的转储。
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
另一种尝试可能是使用 WinDebug 作为调试工具,它非常强大,同时也是轻量级的。
Maybe these tools will allow you at least to narrow the problem to certain component.
也许这些工具至少可以让您将问题缩小到某些组件。
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
并且您确定项目的所有组件都具有正确的运行时库设置(C/C++ 选项卡,VS 6.0 项目设置中的代码生成类别)?
回答by Mike Stone
You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?
您尝试过旧版本,但是否有理由不能继续在存储库历史记录中进一步追溯并准确查看引入错误的时间?
Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.
否则,我建议添加某种简单的日志记录来帮助追踪问题,尽管我不知道您可能想要记录的具体内容。
If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.
如果您可以通过谷歌和您获得的异常的文档找出导致此问题的确切原因,也许这将进一步了解要在代码中查找的内容。