C++ 解决随机崩溃

Question

提问by speeder

I am getting random crashes on my C++ application, it may not crash for a month, and then crash 10 times in a hour, and sometimes it may crash on launch, while sometimes it may crash after several hours of operation (or not crash at all).

我的 C++ 应用程序随机崩溃，它可能不会崩溃一个月，然后在一小时内崩溃 10 次，有时它可能会在启动时崩溃，而有时它可能会在运行几个小时后崩溃（或在全部）。

I use GCC on GNU/Linux and MingW on Windows, thus I can't use the Visual Studio JIT Debug...

我在 GNU/Linux 上使用 GCC，在 Windows 上使用 MingW，因此我无法使用 Visual Studio JIT 调试...

I have no idea on how to proceed, looking randomly on the code would not work, the code is HUGE (and good part was not my work, also it has some good amount of legacy stuff on it), and I also don't have a clue on how to reproduce the crash.

我不知道如何继续，随机查看代码是行不通的，代码很大（很好的部分不是我的工作，它上面也有很多遗留的东西），我也不知道有关于如何重现崩溃的线索。

EDIT: Lots of people mentioned that... how I make a core dump, minidump or whateverdump? This is the first time I need postmortem debugging.

编辑：很多人提到......我如何制作核心转储，小型转储或whateverdump？这是我第一次需要事后调试。

EDIT2: Actually, DrMingw captured a call stack, no memory info... Unfortunately, the call stack don't helped me much, because near the end suddenly it go into some library (or something) that I don't have debug info, resulting only into some hexadecimal numbers... So I still need some decent dump that give more information (specially about what was in the memory... specifically, what was in the place that gave the "access violation" error)

EDIT2：实际上，DrMingw 捕获了一个调用堆栈，没有内存信息......不幸的是，调用堆栈对我没有多大帮助，因为接近尾声时它突然进入了一些我没有调试信息的库（或其他东西），只产生一些十六进制数字......所以我仍然需要一些像样的转储来提供更多信息（特别是关于内存中的内容......特别是给出“访问冲突”错误的地方是什么）

Also, my application use Lua and Luabind, maybe the error is being caused by a .lua script, but I have no idea on how to debug that.

另外，我的应用程序使用 Lua 和 Luabind，可能错误是由 .lua 脚本引起的，但我不知道如何调试它。

Answer 1

回答by Mitch Wheat

Try Valgrind(it's free, open-source):

试试Valgrind（它是免费的、开源的）：

The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes two experimental tools: a heap/stack/global array overrun detector, and a SimPoint basic block vector generator. It runs on the following platforms: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux, and X86/Darwin (Mac OS X).

Valgrind 发行版目前包括六个生产质量工具：一个内存错误检测器、两个线程错误检测器、一个缓存和分支预测分析器、一个调用图生成缓存分析器和一个堆分析器。它还包括两个实验工具：堆/堆栈/全局数组溢出检测器和 SimPoint 基本块向量生成器。它在以下平台上运行：X86/Linux、AMD64/Linux、PPC32/Linux、PPC64/Linux 和 X86/Darwin (Mac OS X)。

Valgrind Frequently Asked Questions

Valgrind 常见问题

The Memcheckpart of the package is probably the place to start:

包的Memcheck部分可能是开始的地方：

Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.

Memcheck 是一个内存错误检测器。它可以检测以下 C 和 C++ 程序中常见的问题。
访问您不应该访问的内存，例如超限和超限运行堆块、超限堆栈顶部以及在释放内存后访问内存。
使用未定义值，即尚未初始化的值，或从其他未定义值派生的值。
不正确地释放堆内存，例如双重释放堆块，或 malloc/new/new[] 与 free/delete/delete[] 的使用不匹配
在 memcpy 和相关函数中重叠 src 和 dst 指针。
内存泄漏。

Answer 2

回答by user239558

First, you are lucky that your process crashes multiple times in a short time-period. That should make it easy to proceed.

首先，您很幸运，您的进程在短时间内多次崩溃。这应该很容易进行。

This is how you proceed.

这就是你继续的方式。

Get a crash dump
Isolate a set of potential suspicious functions
Tighten up state checking
Repeat

获取崩溃转储
隔离一组潜在的可疑功能
加强状态检查
重复

Get a crash dump

获取崩溃转储

First, you really need to get a crash dump.

首先，您确实需要获得故障转储。

If you don't get crash dumps when it crashes, start with writing a test that produces reliable crash dumps.

如果您在崩溃时没有获得崩溃转储，请先编写一个生成可靠崩溃转储的测试。

Re-compile the binary with debug symbols or make sure that you can analyze the crash dump with debug symbols.

使用调试符号重新编译二进制文件，或确保您可以使用调试符号分析故障转储。

Find suspicious functions

查找可疑函数

Given that you have a crash dump, look at it in gdb or your favorite debugger and remember to show all threads! It might not be the thread you see in gdb that is buggy.

鉴于您有崩溃转储，请在 gdb 或您最喜欢的调试器中查看它并记住显示所有线程！它可能不是您在 gdb 中看到的有问题的线程。

Looking at where gdb says your binary crashed, isolate some set of functions you think might cause the problem.

查看 gdb 说你的二进制文件崩溃的地方，隔离一些你认为可能导致问题的函数。

Looking at multiple crashes and isolating code sections that are commonly active in all of the crashes is a real time-saver.

查看多个崩溃并隔离在所有崩溃中通常处于活动状态的代码部分可以真正节省时间。

Tighten up state checking

加强状态检查

A crash usually happens because some inconsistent state. The best way to proceed is often to tighten the state requirements. You do this the following way.

崩溃通常是因为一些不一致的状态而发生的。最好的方法通常是收紧国家要求。您可以通过以下方式执行此操作。

For each function you think might cause the problem, document what legal state the input or the object must have on entry to the function. (Do the same for what legal state it must have on exit from the function, but that's not too important).

对于您认为可能导致问题的每个函数，记录输入或对象在进入函数时必须具有的合法状态。（对退出函数时必须具有的合法状态执行相同的操作，但这不太重要）。

If the function contains a loop, document the legal state it needs to have at the beginning of each loop iteration.

如果函数包含循环，则记录它在每次循环迭代开始时所需的合法状态。

Add asserts for all such expressions of legal state.

为所有这些合法状态的表达添加断言。

Repeat

重复

Then repeat the process. If it still crashes outside of your asserts, tighten the asserts further. At some point the process will crash on an assert and not because of some random crash. At this point you can concentrate on trying to figure out what made your program go from a legal state on entry to the function, to an illegal state at the point where the assert happened.

然后重复这个过程。如果它仍然在您的断言之外崩溃，请进一步收紧断言。在某些时候，进程会在断言时崩溃，而不是因为一些随机崩溃。在这一点上，您可以专注于找出是什么使您的程序从进入函数时的合法状态变成了断言发生时的非法状态。

If you pair the asserts with verbose logging it should be easier to follow what the program does.

如果您将断言与详细日志记录配对，那么遵循程序的操作应该会更容易。

Answer 3

回答by Nicholas Knight

If all else fails (particularly if performance under the debugger is unacceptable), extensive logging. Start with the entry points -- is the app transactional? Log each transaction as it comes in. Log all the constructor calls for your key objects. Since the crash is so intermittent, log calls to all the functions that might not get called every day.

如果所有其他方法都失败（尤其是在调试器下的性能不可接受的情况下），则进行大量日志记录。从入口点开始——应用程序是事务性的吗？记录每笔交易。记录所有对关键对象的构造函数调用。由于崩溃是如此间歇性，请记录对所有可能不会每天都调用的函数的调用。

You'll at least start narrowing down where the crash couldbe.

你至少开始缩小，其中坠机可能是。

Answer 4

回答by sharptooth

Start the program under debugger (I'm sure there is a debugger together with GCC and MingW) and wait until it crashes under debugger. At the point of crash you will be able to see what specific action is failing, look into assembly code, registers, memory state - this will often help you find the cause of the problem.

在调试器下启动程序（我确定有一个调试器与 GCC 和 MingW 一起）并等待它在调试器下崩溃。在崩溃时，您将能够看到失败的特定操作，查看汇编代码、寄存器、内存状态 - 这通常会帮助您找到问题的原因。

Answer 5

回答by ereOn

Where I work, crashing programs usually generates a core dump file that can be loaded in windbg.

在我工作的地方，崩溃的程序通常会生成一个可以加载到 windbg 中的核心转储文件。

We then have an image of the memory at the time the program crashed. There's nothing much you can do with it, but a least it gives you the last call stack. Once you know the function which crashed, you might then be able to track down the problem are at least you might reduce the problem to a more reproductible test-case.

然后我们有一个程序崩溃时的内存图像。您无能为力，但至少它为您提供了最后一个调用堆栈。一旦您知道崩溃的函数，您就可以追踪问题，至少您可以将问题简化为更可重现的测试用例。

Answer 6

回答by froh42

It sounds like your program is suffering from memory corruption. As already said your best option on Linux is probably valgrind. But here are two other options:

听起来您的程序正在遭受内存损坏。如前所述，您在 Linux 上的最佳选择可能是 valgrind。但这里有另外两个选择：

First of all use a debug malloc. Nearly all C libraries offer a debug malloc implementation that initialize memory (normal malloc keeps "old" contents in memory), check the boundaries of an allocated block for corruption and so on. And if that's not enough there is a wide choice of 3rd party implementations.
You might want to have a look at VMWare Workstation. I have not set it up that way, but from their marketing materials they support a rather interesting way of debugging: Run the debugee in a "recording" virtual machine. When memory corruption occurs set a memory breakpoint at the corrupted address an then turn back timein the VM to exactly that moment when that piece of memory was overwritten. See this PDFon how to setup replay debugging with Linux/gdb. I believe there is a 15 or 30 days demo for Workstation 7, that might be enough to shake out those bugs from your code.

首先使用调试 malloc。几乎所有的 C 库都提供了一个调试 malloc 实现来初始化内存（正常的 malloc 将“旧”内容保存在内存中），检查已分配块的边界是否损坏等等。如果这还不够，还有很多 3rd 方实现可供选择。
您可能想看看 VMWare Workstation。我没有这样设置，但从他们的营销材料来看，他们支持一种相当有趣的调试方式：在“录制”虚拟机中运行调试器。当发生内存损坏时，在损坏的地址处设置一个内存断点，然后将VM 中的时间返回到那块内存被覆盖的那一刻。请参阅此 PDF，了解如何使用 Linux/gdb 设置重放调试。我相信 Workstation 7 有 15 或 30 天的演示，这可能足以从您的代码中消除这些错误。

Answer 7

回答by Justin

These sorts of bugs are always tricky - unless you can reproduce the error then your only option is to make changes to your application so that extra information is logged, and then wait until the error happens again in the wild.

这些类型的错误总是很棘手 - 除非您可以重现错误，否则您唯一的选择是对应用程序进行更改，以便记录额外的信息，然后等到错误再次发生。

There is an excellent tool called Process Dumperthat you can use to obtain a crash dump of a process that experiences an exception or exits unexpectedly - you could ask users to install that and configure rules for your application.

有一个名为Process Dumper的优秀工具，您可以使用它来获取遇到异常或意外退出的进程的故障转储 - 您可以要求用户安装该工具并为您的应用程序配置规则。

Alternatively if you don't want to ask users to install other applications you could have your application monitor for exceptions and create a dump itself by calling MiniDumpWriteDump.

或者，如果您不想要求用户安装其他应用程序，您可以让您的应用程序监控异常并通过调用MiniDumpWriteDump 来创建转储本身。

The other option is to improve the logging, however figuring out what information to log (without just logging everything) can be tricky, and so it can take several iterations of crash- change loggingto hunt down the problem.

另一种选择是改进日志记录，但是弄清楚要记录哪些信息（而不只是记录所有内容）可能很棘手，因此可能需要多次崩溃-更改日志记录才能找到问题。

As I said, these sorts of bugs are alwaystricky to diagnose - in my experience it generally involves hours and hours of peering through logs and crash dumps until suddenly you get that eureka moment where everything makes sense - the key is collecting the right information.

正如我所说，这些类型的错误总是很难诊断 - 根据我的经验，它通常需要花费数小时查看日志和崩溃转储，直到突然你得到一切都有意义的尤里卡时刻 - 关键是收集正确的信息。

Answer 8

回答by Nordic Mainframe

You've already heard how to handle this under linux: inspect core dumps and run your code under valgrind. So your first step could be to find the errors under Linux and then check if they vanish under mingw. Since nobody did mention mudflaphere, I'll be doing it: Use mudflap if your Linux distribution supplies it. mudflap helps you to catch pointer misuse and buffer overflows by tracking the information where a pointer is actually allowed to point to:

您已经听说过如何在 linux 下处理此问题：检查核心转储并在 valgrind 下运行您的代码。所以你的第一步可能是在 Linux 下找到错误，然后检查它们是否在 mingw 下消失。由于这里没有人提到过mudflap，我将这样做：如果您的 Linux 发行版提供了 mudflap，请使用它。mudflap 通过跟踪实际允许指针指向的信息，帮助您捕捉指针滥用和缓冲区溢出：

http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging

http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging

And for Windows: There is a JIT debugger for mingw, called DrMingw:

对于 Windows：mingw 有一个 JIT 调试器，称为 DrMingw：

http://code.google.com/p/jrfonseca/wiki/DrMingw

http://code.google.com/p/jrfonseca/wiki/DrMingw

Answer 9

回答by Douglas Leeder

Run the application on Linux under valgrindto look for memory errors. Random crashes are usually down to corrupting memory.

在 Linux 下运行该应用程序valgrind以查找内存错误。随机崩溃通常归结为内存损坏。

Fix every error you find with valgrind's memcheck tool, and then hopefully the crash will go away.

使用 valgrind 的 memcheck 工具修复您发现的每个错误，然后希望崩溃会消失。

If the whole program takes too long to run under valgrind, then split off functionality into unit tests, and run thoseunder valgrind, hopefully you'll find the memory errors that are causing the problems.

如果整个程序在 valgrind 下运行时间太长，那么将功能拆分为单元测试，并在 valgrind 下运行这些测试，希望您能找到导致问题的内存错误。

If it doesn't then make sure coredumps are enabled (ulimit -a) and then when it crashes you'll be able to find out where with gdb.

如果没有，请确保启用了核心转储 ( ulimit -a)，然后当它崩溃时，您将能够找到gdb.

Answer 10

回答by fhd

That sounds like something tricky like a race condition.

这听起来像竞争条件这样的棘手问题。

I'd suggest you create a debug build and use that. You should also make sure that a core dump is created when the program crashes.

我建议您创建一个调试版本并使用它。您还应该确保在程序崩溃时创建核心转储。

The next time the program crashes, you can launch gdb on the coredump and see where the problem lies. It'll probably be a consecutive fault, but this should get you started.

下次程序崩溃时，您可以在 coredump 上启动 gdb 并查看问题所在。这可能是一个连续的错误，但这应该让你开始。

C++ 解决随机崩溃

提问by speeder

回答by Mitch Wheat

回答by user239558

回答by Nicholas Knight

回答by sharptooth

回答by ereOn

回答by froh42

回答by Justin

回答by Nordic Mainframe

回答by Douglas Leeder

回答by fhd

相关推荐

最近更新

标签

C++ 解决随机崩溃

提问by speeder

回答by Mitch Wheat

回答by user239558

回答by Nicholas Knight

回答by sharptooth

回答by ereOn

回答by froh42

回答by Justin

回答by Nordic Mainframe

回答by Douglas Leeder

回答by fhd

相关推荐

C++ 按值而不是按位置擦除向量元素？

将变量名转换为 C++ 中的字符串

C++ 使用命令行选项包含头文件？

C++ stl 中 std::list<std::pair> 和 std::map 的区别

相关推荐

最近更新

标签