Linux nmi 看门狗是如何工作的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9865952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 05:23:56  来源:igfitidea点击:

How does Linux nmi watchdog work?

linuxwatchdogapic

提问by silverbullettt

Now I encounter a problem about Linux NMI Watchdog. I want to use Linux NMI watchdog to detect and recovery OS hang. So I add "nmi_watchdog=1" to grub.cfg. And then check the /proc/interrupt, NMI were triggered per second. But after I load a module with deadlock(double-acquire spinlock), system were hang totally, and nothing occur(never panic!). It looks like that nmi watchdog did not work!

现在遇到一个关于Linux NMI Watchdog的问题。我想使用 Linux NMI 看门狗来检测和恢复操作系统挂起。所以我在 grub.cfg 中添加了“nmi_watchdog=1”。然后检查/proc/interrupt,每秒触发NMI。但是在我加载一个带有死锁(双重获取自旋锁)的模块后,系统完全挂起,没有任何反应(永远不会发生恐慌!)。看起来nmi看门狗不起作用!

Then I read the Documantation/nmi_watchdog.txt, it says:

然后我阅读了Documantation/nmi_watchdog.txt,它说:

Be aware that when using local APIC, the frequency of NMI interrupts it generates, depends on the system load. The local APIC NMI watchdog, lacking a better source, uses the "cycles unhalted" event.

请注意,在使用本地 APIC 时,它生成的 NMI 中断的频率取决于系统负载。本地 APIC NMI 看门狗缺乏更好的来源,使用“周期未暂停”事件。

What's the "cycles unhalted" event?

什么是“周期未停止”事件?

It added

它添加了

but if your system locks up on anything but the "hlt" processor instruction, the watchdog will trigger very soon as the "cycles unhalted" event will happen every clock tick...If it locks up on "hlt", then you are out of luck -- the event will not happen at all and the watchdog won'ttrigger.

但是如果你的系统锁定在“hlt”处理器指令之外的任何东西上,看门狗将很快触发,因为“cycles unhalted”事件会在每个时钟滴答声中发生......如果它锁定在“hlt”上,那么你就出局了幸运的是——事件根本不会发生,看门狗也不会触发。

Seems like that watchdog won't trigger if processor execute "hlt" instruction, then I search "hlt" in "Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A", it describes as follow:

如果处理器执行“hlt”指令,看门狗似乎不会触发,然后我在“ Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A”中搜索“hlt ”,它描述如下:

Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resumeexecution.

停止指令执行并将处理器置于 HALT 状态。启用的中断(包括 NMI 和 SMI)、调试异常、BINIT# 信号、INIT# 信号或 RESET# 信号将恢复执行。

Then I lost...

然后我输了...

My question is:

我的问题是:

  • Howdoes Linux nmi watchdog work?
  • Whotrigger the nmi?
  • Linux nmi 看门狗是如何工作的?
  • 触发了nmi?

My OS is Ubuntn 10.04 LTS, Linux-2.6.32.21, CPU Pentium 4 Dual-core 3.20 GHz.

我的操作系统是 Ubuntn 10.04 LTS、Linux-2.6.32.21、CPU Pentium 4 Dual-core 3.20 GHz。

I didn't read the whole source code about nmi watchdog(no time), if I couldn't understand how nmi watchdog work, I want use performance monitoring counter interruptand inter-processor interrupt(be provided by APIC) to send NMI instead of nmi watchdog.

我没有阅读有关 nmi 看门狗的整个源代码(没有时间),如果我无法理解 nmi 看门狗的工作原理,我想使用性能监控计数器中断处理器间中断(由 APIC 提供)来代替发送 NMI NMI 看门狗。

Could anybody help me? Thanks.

有人可以帮我吗?谢谢。

回答by Johnlcf

As I know, nmi_watchdog would only triggered for non-interruptiblehangs. I found an code example by google: http://oslearn.blogspot.in/2011/04/use-nmi-watchdog.html

据我所知, nmi_watchdog 只会在不可中断的挂起时触发。我找到了谷歌的代码示例:http: //oslearn.blogspot.in/2011/04/use-nmi-watchdog.html

If your deadlock is not non-interruptiable, you can try enable sysRq to trigger some trace (Alt-printscreen-t) or crash (Alt-printscreen-c) to get more information.

如果您的死锁不是不可中断的,您可以尝试启用 sysRq 以触发某些跟踪 (Alt-printscreen-t) 或崩溃 (Alt-printscreen-c) 以获取更多信息。

回答by Courtney Schwartz

The answer depends on your hardware.

答案取决于您的硬件。

Non-maskable interrupts (NMI) can be triggered 2 ways: 1) when the kernel reaches a halting state that can't be interrupted by another method, and 2) by hardware -- using an NMI button.

不可屏蔽中断 (NMI) 可以通过两种方式触发:1) 当内核达到无法被其他方法中断的暂停状态时,以及 2) 通过硬件 - 使用 NMI 按钮。

On the front of some Dell servers, for example, you will see a small circle with a zig-zag line inside it. This is the NMI symbol. Nearby there is a hole. Insert a pin to trigger the interrupt. If your kernel is built to support it, this will dump a kernel panic trace to the console, then reboot the system.

例如,在某些戴尔服务器的正面,您会看到一个小圆圈,里面有一条锯齿线。这是 NMI 符号。附近有个洞。插入引脚以触发中断。如果您的内核是为支持它而构建的,这会将内核恐慌跟踪转储到控制台,然后重新启动系统。

This can happen very fast. So if you don't have a console attached to save the output to a file, it might look like only a reboot.

这可能发生得非常快。因此,如果您没有附加控制台来将输出保存到文件中,它可能看起来只是重新启动。