Linux 调试在客户盒子上生成的核心文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10616417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Debugging core files generated on a Customer's box
提问by Mohamed Bana
We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 withoutdebugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
我们通过在客户的机器上运行我们的软件获得核心文件。不幸的是,因为我们总是在没有调试符号的情况下使用 -O2进行编译,这导致我们无法弄清楚它为什么崩溃的情况,我们已经修改了构建,所以现在它们一起生成 -g 和 -O2。然后我们建议客户运行 -g 二进制文件,以便调试变得更容易。
I have a few questions:
我有几个问题:
- What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
- Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
- 如果核心文件是从我们在 Dev 中运行的 Linux 发行版以外的发行版生成的,会发生什么?堆栈跟踪是否有意义?
- 有没有关于在 Linux 或 Solaris 上调试的好书?面向示例的东西会很棒。我正在寻找现实生活中的例子,以找出例程崩溃的原因以及作者如何找到解决方案。中高级水平的东西会更好,因为我已经这样做了一段时间了。一些组装也会很好。
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
这是一个崩溃示例,它要求我们告诉客户获取 -g 版本。二进制的:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
理想情况下,我想找出应用程序崩溃的确切原因 - 我怀疑是内存损坏,但我不是 100% 确定。
Remote debugging is strictly not allowed.
严禁远程调试。
Thanks
谢谢
采纳答案by Employed Russian
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
如果核心文件是从我们在 Dev 中运行的 Linux 发行版以外的发行版生成的,会发生什么?堆栈跟踪是否有意义?
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) notbe meaningful.
如果可执行文件是动态链接的,就像你的一样,GDB 产生的堆栈(很可能)没有意义。
The reason: GDB knows that your executable crashed by calling something in libc.so.6
at address 0x00454ff1
, but it doesn't know what code was at that address. So it looks into yourcopy of libc.so.6
and discovers that this is in select
, so it prints that.
原因:GDB 知道您的可执行文件通过调用libc.so.6
at address 中的某些内容而崩溃0x00454ff1
,但它不知道该地址处的代码是什么。所以它查看你的副本libc.so.6
并发现它在 中select
,所以它打印出来。
But the chances that 0x00454ff1
is also in select in your customerscopy of libc.so.6
are quite small. Most likely the customer had some other procedure at that address, perhaps abort
.
但是0x00454ff1
在您的客户副本中选择的机会也libc.so.6
很小。客户很可能在该地址执行了一些其他程序,也许abort
。
You can use disas select
, and observe that 0x00454ff1
is either in the middle of instruction, or that the previous instruction is not a CALL
. If either of these holds, your stack trace is meaningless.
您可以使用disas select
, 并观察它0x00454ff1
是否在指令中间,或者前一条指令不是CALL
。如果其中任何一个成立,您的堆栈跟踪就毫无意义。
You canhowever help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared
from the customer system. Have the customer tar them up with e.g.
但是,您可以帮助自己:您只需(gdb) info shared
要从客户系统中获取列出的所有库的副本。让客户用例如
cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...
Then, on your system:
然后,在您的系统上:
mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core # Note: very important to set solib-... before loading core
(gdb) where # Get meaningful stack trace!
We then advice the Customer to run a -g binary so it becomes easier to debug.
然后我们建议客户运行 -g 二进制文件,以便调试变得更容易。
A muchbetter approach is:
一个多更好的方法是:
- build with
-g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
- distribute
myexe
to customers - when a customer gets a
core
, usemyexe.dbg
to debug it
- 用
-g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
- 分
myexe
发给客户 - 当客户得到一个
core
,myexe.dbg
用来调试它
You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
您将拥有完整的符号信息(文件/行、局部变量),无需向客户发送特殊的二进制文件,也无需透露有关您的源的太多详细信息。
回答by saurabh jindal
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
据我所知,您不需要让您的客户使用 -g 选项构建的二进制文件运行。需要的是您应该使用 -g 选项进行构建。有了它,您可以加载核心文件,它将显示整个堆栈跟踪。我记得几周前,我创建了核心文件,使用 build (-g) 和不使用 -g,并且核心的大小相同。
回答by Charlie Martin
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g
compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.
您确实可以从故障转储中获得有用的信息,甚至可以从优化的编译中获得有用的信息(尽管从技术上讲,这就是所谓的“大麻烦”。)-g
编译确实更好,是的,您甚至可以这样做当发生转储的机器是另一个发行版时。基本上,有一个警告,所有重要信息都包含在可执行文件中,并最终在转储中。
When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
当您将核心文件与可执行文件匹配时,调试器将能够告诉您崩溃发生的位置并向您显示堆栈。这本身应该有很大帮助。您还应该尽可能多地了解它发生的情况——他们能否可靠地重现它?如果是这样,你能重现它吗?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so
files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so
it happens in.
现在,这里有一个警告:“一切都在那里”的概念崩溃的地方是共享对象文件,.so
文件。如果由于这些问题而失败,您将没有所需的符号表;您可能只能看到.so
它发生在哪个库中。
There are a number of books about debugging, but I can't think of one I'd recommend.
有很多关于调试的书,但我想不出我推荐的一本。
回答by Malkocoglu
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
检查您在遍历堆栈时看到的局部变量的值?特别是在 select() 调用周围。在客户的盒子上执行此操作,只需加载转储并遍历堆栈...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !
另外,在您的 DEV 和 PROD 平台上检查 FD_SETSIZE 的值!