Linux 什么是不间断进程?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/223644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 16:38:06  来源:igfitidea点击:

What is an uninterruptible process?

linuxschedulingpreemption

提问by Jason Baker

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:

有时,每当我在 Linux 中编写程序并由于某种错误而崩溃时,它将成为一个不可中断的进程并继续运行直到我重新启动计算机(即使我注销)。我的问题是:

  • What causes a process to become uninterruptible?
  • How do I stop that from happening?
  • This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
  • 是什么导致进程变得不可中断?
  • 我如何阻止这种情况发生?
  • 这可能是一个愚蠢的问题,但是有没有办法在不重新启动计算机的情况下中断它?

采纳答案by ddaa

An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.

不可中断进程是恰好处于系统调用(内核函数)中且不能被信号中断的进程。

To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.

要理解这意味着什么,您需要了解可中断系统调用的概念。经典的例子是read()。这是一个可能需要很长时间(几秒)的系统调用,因为它可能涉及启动硬盘驱动器或移动磁头。在这段时间的大部分时间里,进程将处于休眠状态,在硬件上阻塞。

While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:

当进程在系统调用中休眠时,它可以接收 Unix 异步信号(例如 SIGTERM),然后会发生以下情况:

  • The system calls exits prematurely, and is set up to return -EINTR to userspace.
  • The signal handler is executed.
  • If the process is still running, it gets the return value from the system call, and it can make the same call again.
  • 系统调用过早退出,并设置为将 -EINTR 返回到用户空间。
  • 信号处理程序被执行。
  • 如果进程仍在运行,它会从系统调用中获取返回值,并且可以再次进行相同的调用。

Returning early from the system call enables the user space code to immediately alter its behaviour in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.

从系统调用中提前返回使用户空间代码能够立即改变其响应信号的行为。例如,干净地终止以响应 SIGINT 或 SIGTERM。

On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.

另一方面,某些系统调用是不允许以这种方式中断的。如果系统调用由于某种原因停止,该进程可以无限期地保持在这种不可终止状态。

LWN ran a nice articlethat touched this topic in July.

LWN在 7 月份发表了一篇很好的文章,触及了这个话题。

To answer the original question:

要回答原始问题:

  • How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.

  • How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.

  • 如何防止这种情况发生:找出哪个驱动程序给你带来了麻烦,要么停止使用,要么成为内核黑客并修复它。

  • 如何在不重新启动的情况下终止不可中断的进程:以某种方式使系统调用终止。通常,在不按电源开关的情况下执行此操作的最有效方法是拉电源线。您还可以成为内核黑客并使驱动程序使用 TASK_KILLABLE,如 LWN 文章中所述。

回答by ADEpt

If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.

如果您正在谈论“僵尸”进程(在 ps 输出中被指定为“僵尸”),那么这是进程列表中等待某人收集其返回码的无害记录,可以安全地忽略它。

Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

你能描述一下什么是“不间断的过程”吗?它是否在“kill -9”中幸存下来并愉快地前进?如果是这种情况,那么它会卡在某个系统调用上,该调用卡在某个驱动程序中,并且您一直在执行此过程直到重新启动(有时最好尽快重新启动)或卸载相关驱动程序(这不太可能发生) . 您可以尝试使用“strace”来找出您的流程卡住的位置并在将来避免它。

回答by CesarB

When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERMand SIGKILL). This means a process can be killed only on return to user mode.

当一个进程处于用户模式时,它可以随时被中断(切换到内核模式)。当内核返回用户模式时,它会检查是否有任何未决信号(包括用于终止进程的信号,例如SIGTERMSIGKILL)。这意味着只有在返回用户模式时才能终止进程。

The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).

一个进程不能在内核模式下被杀死的原因是它可能会破坏同一台机器中所有其他进程使用的内核结构(同样的方式杀死一个线程可能会破坏同一进程中其他线程使用的数据结构) .

When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).

当内核需要做一些可能需要很长时间的事情时(例如,等待另一个进程写入的管道或等待硬件做某事),它通过将自己标记为睡眠并调用调度程序切换到另一个来睡眠进程(如果没有非睡眠进程,它会切换到一个“虚拟”进程,它告诉 cpu 慢一点并处于循环中——空闲循环)。

If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:

如果一个信号被发送到一个睡眠进程,它必须在它返回用户空间之前被唤醒,从而处理挂起的信号。这里我们有两种主要睡眠类型的区别:

  • TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
  • TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
  • TASK_INTERRUPTIBLE,可中断的睡眠。如果任务标有此标志,则它处于休眠状态,但可以被信号唤醒。这意味着将任务标记为休眠的代码正在等待一个可能的信号,并且在它唤醒后将检查它并从系统调用返回。处理完信号后,系统调用可能会自动重新启动(我不会详细介绍其工作原理)。
  • TASK_UNINTERRUPTIBLE,不间断的睡眠。如果一个任务被标记为这个标志,它不希望被它等待的任何东西唤醒,要么因为它不容易重新启动,要么因为程序期望系统调用是原子的。这也可用于已知非常短的睡眠。

TASK_KILLABLE(mentioned in the LWN article linked to by ddaa's answer) is a new variant.

TASK_KILLABLE(在 ddaa 的回答链接到的 LWN 文章中提到)是一个新的变体。

This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).

这回答了你的第一个问题。关于您的第二个问题:您无法避免不间断睡眠,它们是正常的事情(例如,每次进程从磁盘读取/写入磁盘时都会发生这种情况);但是,它们应该只持续几分之一秒。如果它们持续更长时间,通常意味着硬件问题(或设备驱动程序问题,在内核看来是相同的),其中设备驱动程序正在等待硬件做一些永远不会发生的事情。这也可能意味着您正在使用 NFS 并且 NFS 服务器已关闭(它正在等待服务器恢复;您也可以使用“intr”选项来避免该问题)。

Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).

最后,您无法恢复的原因与内核等待直到返回用户模式以传递信号或终止进程的原因相同:它可能会破坏内核的数据结构(等待可中断睡眠的代码可能会收到一个错误,告诉它返回到用户空间,在那里进程可以被杀死;等待不间断睡眠的代码不期待任何错误)。

回答by MarkR

Uninterruptable processes are USUALLY waiting for I/O following a page fault.

不可中断的进程通常在页面错误后等待 I/O。

Consider this:

考虑一下:

  • The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
  • The kernel is now (trying to) load it in
  • The process can't continue until the page is available.
  • 线程尝试访问不在核心中的页面(按需加载的可执行文件、已换出的匿名内存页面或按需加载的 mmap() 文件,这些都是一样)
  • 内核现在(试图)加载它
  • 在页面可用之前,该过程无法继续。

The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.

进程/任务在这种状态下不能被中断,因为它不能处理任何信号;如果是这样,另一个页面错误会发生,它会回到原来的位置。

When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc

当我说“进程”时,我真正的意思是“任务”,它在 Linux (2.6) 下大致翻译为“线程”,它在 /proc 中可能有也可能没有单独的“线程组”条目

In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.

在某些情况下,它可能会等待很长时间。一个典型的例子是可执行文件或 mmap 文件位于服务器出现故障的网络文件系统上。如果 I/O 最终成功,任务将继续。如果最终失败,任务一般会得到一个 SIGBUS 什么的。

回答by Ron Granger

To your 3rd question: I think you can kill the uninterruptable processes by running sudo kill -HUP 1. It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.

对于您的第三个问题:我认为您可以通过运行 sudo kill -HUP 1. 它将在不结束正在运行的进程的情况下重新启动 init,并且在运行它之后,我的不可中断的进程消失了。