Bash 管道处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19122/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash Pipe Handling
提问by num1
Does anyone know how bash handles sending data through pipes?
有谁知道 bash 如何处理通过管道发送数据?
cat file.txt | tail -20
Does this command print all the contents of file.txt into a buffer, which is then read by tail? Or does this command, say, print the contents of file.txt line by line, and then pause at each line for tail to process, and then ask for more data?
该命令是否将 file.txt 的所有内容打印到缓冲区中,然后通过 tail 读取?还是这个命令,比如说,逐行打印file.txt的内容,然后在每一行暂停,让tail处理,然后再请求更多的数据?
The reason I ask is that I'm writing a program on an embedded device that basically performs a sequence of operations on some chunk of data, where the output of one operation is send off as the input of the next operation. I would like to know how linux (bash) handles this so please give me a general answer, not specifically what happens when I run "cat file.txt | tail -20".
我问的原因是我正在嵌入式设备上编写一个程序,该程序基本上对一些数据块执行一系列操作,其中一个操作的输出作为下一个操作的输入发送。我想知道 linux (bash) 如何处理这个问题,所以请给我一个一般性的答案,而不是具体当我运行“cat file.txt | tail -20”时会发生什么。
EDIT: Shog9 pointed out a relevant Wikipedia Article, this didn't lead me directly to the article but it helped me find this: http://en.wikipedia.org/wiki/Pipeline_%28Unix%29#Implementationwhich did have the information I was looking for.
编辑:Shog9 指出了一篇相关的维基百科文章,这并没有直接把我带到这篇文章,但它帮助我找到了这个:http: //en.wikipedia.org/wiki/Pipeline_%28Unix%29#Implementation确实有我正在寻找的信息。
I'm sorry for not making myself clear. Of course you're using a pipe and of course you're using stdin and stdout of the respective parts of the command. I had assumed that was too obvious to state.
我很抱歉没有说清楚。当然,您使用的是管道,当然您使用的是命令相应部分的 stdin 和 stdout。我认为这太明显了,无法说明。
What I'm asking is how this is handled/implemented. Since both programs cannot run at once, how is data sent from stdin to stdout? What happens if the first program generates data significantly faster than the second program? Does the system just run the first command until either it's terminated or it's stdout buffer is full, and then move on to the next program, and so on in a loop until no more data is left to be processed or is there a more complicated mechanism?
我要问的是这是如何处理/实施的。由于两个程序不能同时运行,那么数据如何从标准输入发送到标准输出?如果第一个程序生成数据的速度明显快于第二个程序,会发生什么情况?系统是否只运行第一个命令,直到它终止或它的 stdout 缓冲区已满,然后继续执行下一个程序,以此类推,直到没有更多数据需要处理或是否有更复杂的机制?
回答by postfuturist
I decided to write a slightly more detailed explanation.
我决定写一个稍微详细一点的解释。
The "magic" here lies in the operating system. Both programs do start up at roughly the same time, and run at the same time (the operating system assigns them slices of time on the processor to run) as every other simultaneously running process on your computer (including the terminal application and the kernel). So, before any data gets passed, the processes are doing whatever initialization necessary. In your example, tail is parsing the '-20' argument and cat is parsing the 'file.txt' argument and opening the file. At some point tail will get to the point where it needs input and it will tell the operating system that it is waiting for input. At some other point (either before or after, it doesn't matter) cat will start passing data to the operating system using stdout. This goes into a buffer in the operating system. The next time tail gets a time slice on the processor after some data has been put into the buffer by cat, it will retrieve some amount of that data (or all of it) which leaves the buffer on the operating system. When the buffer is empty, at some point tail will have to wait for cat to output more data. If cat is outputting data much faster than tail is handling it, the buffer will expand. cat will eventually be done outputting data, but tail will still be processing, so cat will close and tail will process all remaining data in the buffer. The operating system will signal tail when their is no more incoming data with an EOF. Tail will process the remaining data. In this case, tail is probably just receiving all the data into a circular buffer of 20 lines, and when it is signalled by the operating system that there is no more incoming data, it then dumps the last twenty lines to its own stdout, which just gets displayed in the terminal. Since tail is a much simpler program than cat, it will likely spend most of the time waiting for cat to put data into the buffer.
这里的“魔法”在于操作系统。这两个程序确实在大致相同的时间启动,并与计算机上的每个其他同时运行的进程(包括终端应用程序和内核)同时运行(操作系统为它们分配处理器上的时间片来运行) . 因此,在传递任何数据之前,进程会进行任何必要的初始化。在您的示例中, tail 正在解析 '-20' 参数, cat 正在解析 'file.txt' 参数并打开文件。在某个时候,tail 会到达需要输入的地方,它会告诉操作系统它正在等待输入。在其他时间点(无论是之前还是之后,都没有关系)cat 将开始使用 stdout 将数据传递到操作系统。这将进入操作系统中的缓冲区。在 cat 将一些数据放入缓冲区后,下一次 tail 在处理器上获取时间片时,它将检索一些数据(或全部),这些数据将缓冲区留在操作系统上。当缓冲区为空时,在某些时候 tail 将不得不等待 cat 输出更多数据。如果 cat 输出数据的速度比 tail 处理数据的速度快得多,则缓冲区将扩大。cat 最终将完成输出数据,但 tail 仍在处理中,因此 cat 将关闭并且 tail 将处理缓冲区中的所有剩余数据。当操作系统不再使用 EOF 传入数据时,操作系统将发出尾部信号。Tail 将处理剩余的数据。在这种情况下,tail 可能只是将所有数据接收到一个 20 行的循环缓冲区中,当操作系统发出信号表明没有更多传入数据时,它会将最后二十行转储到自己的标准输出中,该标准输出只会显示在终端中。由于 tail 是比 cat 简单得多的程序,因此它可能会花费大部分时间等待 cat 将数据放入缓冲区。
On a system with multiple processors, the two programs will not just be sharing alternating time slices on the same processor core, but likely running at the same time on separate cores.
在具有多个处理器的系统上,这两个程序不仅会在同一个处理器内核上共享交替的时间片,还可能同时在不同的内核上运行。
To get into a little more detail, if you open some kind of process monitor (operating system specific) like 'top' in Linux you will see a whole list of running processes, most of which are effectively using 0% of the processor. Most applications, unless they are crunching data, spend most of their time doing nothing. This is good, because it allows other processes to have unfettered access to the processor according to their needs. This is accomplished in basically three ways. A process could get to a sleep(n) style instruction where it basically tells the kernel to wait n milliseconds before giving it another time slice to work with. Most commonly a program needs to wait for something from another program, like 'tail' waiting for more data to enter the buffer. In this case the operating system will wake up the process when more data is available. Lastly, the kernel can preempt a process in the middle of execution, giving some processor time slices to other processes. 'cat' and 'tail' are simple programs. In this example, tail spends most of it's time waiting for more data on the buffer, and cat spends most of it's time waiting for the operating system to retrieve data from the harddrive. The bottleneck is the speed (or slowness) of the physical medium that the file is stored on. That perceptible delay you might detect when you run this command for the first time is the time it takes for the read heads on the disk drive to seek to the position on the harddrive where 'file.txt' is. If you run the command a second time, the operating system will likely have the contents of file.txt cached in memory, and you will not likely see any perceptible delay (unless file.txt is very large, or the file is no longer cached.)
为了更详细地了解,如果您在 Linux 中打开某种进程监视器(特定于操作系统),例如“top”,您将看到正在运行的进程的完整列表,其中大多数有效地使用了 0% 的处理器。大多数应用程序,除非它们处理数据,否则大部分时间都无所事事。这很好,因为它允许其他进程根据他们的需要不受限制地访问处理器。这主要通过三种方式实现。一个进程可以进入 sleep(n) 风格的指令,它基本上告诉内核等待 n 毫秒,然后再给它另一个时间片来处理。最常见的是,程序需要等待来自另一个程序的某些东西,例如“tail”等待更多数据进入缓冲区。在这种情况下,操作系统将在有更多数据可用时唤醒进程。最后,内核可以在执行过程中抢占一个进程,将一些处理器时间片分配给其他进程。'cat' 和 'tail' 是简单的程序。在这个例子中,tail 大部分时间都在等待缓冲区上的更多数据,而 cat 大部分时间都在等待操作系统从硬盘中检索数据。瓶颈是存储文件的物理介质的速度(或慢度)。当您第一次运行此命令时,您可能会检测到的明显延迟是磁盘驱动器上的读取磁头寻找到硬盘驱动器上“file.txt”所在位置所需的时间。如果您第二次运行该命令,
Most operations you do on your computer are IO bound, which is to say that you are usually waiting for data to come from your harddrive, or from a network device, etc.
您在计算机上执行的大多数操作都是 IO 绑定的,也就是说您通常在等待数据来自硬盘驱动器或网络设备等。
回答by David Schlosnagle
Shog9 already referenced the Wikipedia article, but the implementation sectionhas the details you want. The basic implementation is a bounded buffer.
Shog9 已经引用了维基百科文章,但实现部分有你想要的细节。基本实现是有界缓冲区。
回答by Mike Stone
cat will just print the data to standard out, which happens to be redirected to the standard in of tail. This can be seen in the man page of bash.
cat 只会将数据打印到标准输出,这恰好被重定向到尾部的标准输入。这可以在 bash 的手册页中看到。
In other words, there is no pausing going on, tail is just reading from standard in and cat is just writing to standard out.
换句话说,没有暂停,tail 只是从标准输入读取,而 cat 只是写入标准输出。

