Python 为什么 subprocess.run 输出与相同命令的 shell 输出不同？

Question

提问by user2346536

I am using subprocess.run()for some automated testing. Mostly to automate doing:

我正在使用subprocess.run()一些自动化测试。主要是为了自动化：

dummy.exe < file.txt > foo.txt
diff file.txt foo.txt

If you execute the above redirection in a shell, the two files are always identical. But whenever file.txtis too long, the below Python code does not return the correct result.

如果在 shell 中执行上述重定向，则两个文件始终相同。但是只要file.txt太长，下面的 Python 代码就不会返回正确的结果。

This is the Python code:

这是Python代码：

import subprocess
import sys


def main(argv):

    exe_path = r'dummy.exe'
    file_path = r'file.txt'

    with open(file_path, 'r') as test_file:
        stdin = test_file.read().strip()
        p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE, universal_newlines=True)
        out = p.stdout.strip()
        err = p.stderr
        if stdin == out:
            print('OK')
        else:
            print('failed: ' + out)

if __name__ == "__main__":
    main(sys.argv[1:])

Here is the C++ code in dummy.cc:

这是 C++ 代码dummy.cc：

#include <iostream>


int main()
{
    int size, count, a, b;
    std::cin >> size;
    std::cin >> count;

    std::cout << size << " " << count << std::endl;


    for (int i = 0; i < count; ++i)
    {
        std::cin >> a >> b;
        std::cout << a << " " << b << std::endl;
    }
}

file.txtcan be anything like this:

file.txt可以是这样的：

The second integer on the first line is the number of lines following, hence here file.txtwill be 100,001 lines long.

第一行的第二个整数是后面的行数，因此这里file.txt有 100,001 行。

Question:Am I misusing subprocess.run() ?

问题：我是否滥用 subprocess.run() ？

Edit

编辑

My exact Python code after comment (newlines,rb) is taken into account:

考虑到注释（换行符，rb）后我的确切 Python 代码：

import subprocess
import sys
import os


def main(argv):

    base_dir = os.path.dirname(__file__)
    exe_path = os.path.join(base_dir, 'dummy.exe')
    file_path = os.path.join(base_dir, 'infile.txt')
    out_path = os.path.join(base_dir, 'outfile.txt')

    with open(file_path, 'rb') as test_file:
        stdin = test_file.read().strip()
        p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE)
        out = p.stdout.strip()
        if stdin == out:
            print('OK')
        else:
            with open(out_path, "wb") as text_file:
                text_file.write(out)

if __name__ == "__main__":
    main(sys.argv[1:])

Here is the first diff:

这是第一个差异：

Here is the input file: https://drive.google.com/open?id=0B--mU_EsNUGTR3VKaktvQVNtLTQ

这是输入文件：https: //drive.google.com/open?id=0B--mU_EsNUGTR3VKaktvQVNtLTQ

Answer 1

采纳答案by jfs

To reproduce, the shell command:

要重现，shell 命令：

subprocess.run("dummy.exe < file.txt > foo.txt", shell=True, check=True)

without the shell in Python:

没有 Python 中的 shell：

with open('file.txt', 'rb', 0) as input_file, \
     open('foo.txt', 'wb', 0) as output_file:
    subprocess.run(["dummy.exe"], stdin=input_file, stdout=output_file, check=True)

It works with arbitrary large files.

它适用于任意大文件。

You could use subprocess.check_call()in this case (available since Python 2), instead of subprocess.run()that is available only in Python 3.5+.

您可以subprocess.check_call()在这种情况下使用（自 Python 2 起可用），而不是subprocess.run()仅在 Python 3.5+ 中可用。

Works very well thanks. But then why was the original failing ? Pipe buffer size as in Kevin Answer ?

效果很好，谢谢。但是为什么原来的失败了？Kevin Answer 中的管道缓冲区大小？

It has nothing to do with OS pipe buffers. The warning from the subprocess docs that @Kevin J. Chase cites is unrelated to subprocess.run(). You should care about OS pipe buffers only if you use process = Popen()and manuallyread()/write() via multiple pipe streams (process.stdin/.stdout/.stderr).

它与操作系统管道缓冲区无关。@Kevin J. Chase 引用的子流程文档中的警告与subprocess.run(). 仅当您通过多个管道流 ( )使用process = Popen()并手动read()/write() 时，才应该关心 OS 管道缓冲区process.stdin/.stdout/.stderr。

It turns out that the observed behavior is due to Windows bug in the Universal CRT. Here's the same issue that is reproduced without Python: Why would redirection work where piping fails?

事实证明，观察到的行为是由于通用 CRT 中的 Windows 错误。这是在没有 Python 的情况下重现的相同问题：为什么在管道失败的情况下重定向工作？

As said in the bug description, to workaround it:

正如错误描述中所说，要解决它：

"use a binary pipe and do text mode CRLF => LF translation manually on the reader side"or use ReadFile()directly instead of std::cin
or wait for Windows 10 update this summer (where the bug should be fixed)
or use a different C++ compiler e.g., there is no issue if you use g++on Windows

“使用二进制管道并在阅读器端手动执行文本模式 CRLF => LF 翻译”或ReadFile()直接使用而不是std::cin
或等待今年夏天的 Windows 10 更新（应该修复该错误）
或使用不同的 C++ 编译器，例如，如果您在 Windows 上使用，则没有问题g++

The bug affects only text pipes i.e., the code that uses <>should be fine (stdin=input_file, stdout=output_fileshould still work or it is some other bug).

该错误仅影响文本管道，即使用的代码<>应该没问题（stdin=input_file, stdout=output_file应该仍然有效，或者是其他一些错误）。

Answer 2

回答by Kevin J. Chase

I'll start with a disclaimer: I don't have Python 3.5 (so I can't use the runfunction), and I wasn't able to reproduce your problem on Windows (Python 3.4.4) or Linux (3.1.6). That said...

我将从免责声明开始：我没有 Python 3.5（因此我无法使用该run功能），并且我无法在 Windows (Python 3.4.4) 或 Linux (3.1.6) 上重现您的问题）。那说...

Problems with `subprocess.PIPE`and Family

`subprocess.PIPE`和家人的问题

The subprocess.rundocs say that it's just a front-end for the old subprocess.Popen-and-communicate()technique. The subprocess.Popen.communicatedocs warn that:

该subprocess.run文件说，这只是一个前端的老subprocess.Popen-和-communicate()技术。该subprocess.Popen.communicate文件警告说：

The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

读取的数据是缓存在内存中的，所以如果数据量很大或者没有限制就不要使用这种方式。

This sure sounds like your problem. Unfortunately, the docs don't say how much data is "large", nor whatwill happen after "too much" data is read. Just "don't do that, then".

这肯定听起来像你的问题。不幸的是，文档没有说明有多少数据是“大”的，也没有说明读取“太多”数据后会发生什么。只是“不要那样做”。

The docs for subprocess.callgo into a little more detail (emphasis mine)...

文档subprocess.call更详细一些（强调我的）...

Do not use stdout=PIPEor stderr=PIPEwith this function. The child process will block if it generates enough output to a pipe to fill up the OS pipe bufferas the pipes are not being read from.

请勿使用stdout=PIPE或stderr=PIPE与此功能一起使用。如果子进程生成足够的输出到管道以填充操作系统管道缓冲区，则子进程将阻塞，因为管道未被读取。

...as do the docs for subprocess.Popen.wait:

...和文档一样subprocess.Popen.wait：

This will deadlock when using stdout=PIPEor stderr=PIPEand the child process generates enough output to a pipe such that it blocks waiting for the OS pipe bufferto accept more data. Use Popen.communicate()when using pipes to avoid that.

这将在使用stdout=PIPEor时死锁，stderr=PIPE并且子进程向管道生成足够的输出，从而阻止等待 OS 管道缓冲区接受更多数据。使用Popen.communicate()使用管道时要避免这种情况。

That sure sounds like Popen.communicateis the solution to this problem, but communicate's own docs say "do not use this method if the data size is large" --- exactly the situation where the waitdocs tell you touse communicate. (Maybe it "avoid(s) that" by silently dropping data on the floor?)

这肯定听起来就像Popen.communicate是解决这个问题，但是communicate自己的文档说---确切位置的情况，‘如果数据量很大不使用这种方法’wait文档告诉你要使用communicate。（也许是通过默默地将数据丢在地板上来“避免这种情况”？）

Frustratingly, I don't see any way to use a subprocess.PIPEsafely, unless you're sure you can read from it faster than your child process writes to it.

令人沮丧的是，我看不到任何subprocess.PIPE安全使用 a 的方法，除非您确定读取它的速度比您的子进程写入它的速度快。

On that note...

在那张纸条上...

Alternative: `tempfile.TemporaryFile`

选择： `tempfile.TemporaryFile`

You're holding allyour data in memory... twice, in fact. That can't be efficient, especially if it's already in a file.

您将所有数据保存在内存中……实际上是两次。这效率不高，特别是如果它已经在一个文件中。

If you're allowed to use a temporary file, you can compare the two files very easily, one line at a time. This avoids all the subprocess.PIPEmess, and it's much faster, because it only uses a little bit of RAM at a time. (The IO from your subprocess might be faster, too, depending on how your operating system handles output redirection.)

如果允许您使用临时文件，则可以非常轻松地比较这两个文件，一次一行。这避免了所有的subprocess.PIPE混乱，而且速度要快得多，因为它一次只使用一点点 RAM。（来自子进程的 IO 也可能更快，这取决于您的操作系统如何处理输出重定向。）

Again, I can't test run, so here's a slightly older Popen-and-communicatesolution (minus mainand the rest of your setup):

同样，我无法测试run，所以这里是一个年龄稍大的Popen-和-communicate溶液（减去main和您的设置的其余部分）：

import io
import subprocess
import tempfile

def are_text_files_equal(file0, file1):
    '''
    Both files must be opened in "update" mode ('+' character), so
    they can be rewound to their beginnings.  Both files will be read
    until just past the first differing line, or to the end of the
    files if no differences were encountered.
    '''
    file0.seek(io.SEEK_SET)
    file1.seek(io.SEEK_SET)
    for line0, line1 in zip(file0, file1):
        if line0 != line1:
            return False
    # Both files were identical to this point.  See if either file
    # has more data.
    next0 = next(file0, '')
    next1 = next(file1, '')
    if next0 or next1:
        return False
    return True

def compare_subprocess_output(exe_path, input_path):
    with tempfile.TemporaryFile(mode='w+t', encoding='utf8') as temp_file:
        with open(input_path, 'r+t') as input_file:
            p = subprocess.Popen(
              [exe_path],
              stdin=input_file,
              stdout=temp_file,  # No more PIPE.
              stderr=subprocess.PIPE,  # <sigh>
              universal_newlines=True,
              )
            err = p.communicate()[1]  # No need to store output.
            # Compare input and output files...  This must be inside
            # the `with` block, or the TemporaryFile will close before
            # we can use it.
            if are_text_files_equal(temp_file, input_file):
                print('OK')
            else:
                print('Failed: ' + str(err))
    return

Unfortunately, since I can't reproduce your problem, even with a million-line input, I can't tell if this works. If nothing else, it ought to give you wrong answers faster.

不幸的是，由于我无法重现您的问题，即使有百万行的输入，我也不知道这是否有效。如果不出意外，它应该更快地给你错误的答案。

Variant: Regular File

变体：常规文件

If you want to keep the output of your test run in foo.txt(from your command-line example), then you would direct your subprocess' output to a normal file instead of a TemporaryFile. This is the solution recommended in J.F. Sebastian's answer.

如果您想保留测试运行的输出foo.txt（来自命令行示例），那么您可以将子进程的输出定向到普通文件而不是TemporaryFile. 这是JF Sebastian's answer 中推荐的解决方案。

I can't tell from your question if you wantedfoo.txt, or if it was just a side-effect of the two step test-then-diff--- your command-line example saves test output to a file, while your Python script doesn't. Saving the output would be handy if you ever want to investigate a test failure, but it requires coming up with a unique filename for each test you run, so they don't overwrite each other's output.

我无法从您的问题中判断您是否需要foo.txt，或者这是否只是两步测试的副作用diff--您的命令行示例将测试输出保存到文件中，而您的 Python 脚本则没有吨。如果您想调查测试失败，保存输出会很方便，但它需要为您运行的每个测试提供一个唯一的文件名，因此它们不会覆盖彼此的输出。

Python 为什么 subprocess.run 输出与相同命令的 shell 输出不同？

提问by user2346536

采纳答案by jfs

回答by Kevin J. Chase

Problems with `subprocess.PIPE`and Family

`subprocess.PIPE`和家人的问题

Alternative: `tempfile.TemporaryFile`

选择： `tempfile.TemporaryFile`

Variant: Regular File

变体：常规文件

相关推荐

最近更新

标签

Python 为什么 subprocess.run 输出与相同命令的 shell 输出不同？

提问by user2346536

采纳答案by jfs

回答by Kevin J. Chase

Problems with subprocess.PIPEand Family

subprocess.PIPE和家人的问题

Alternative: tempfile.TemporaryFile

选择： tempfile.TemporaryFile

Variant: Regular File

变体：常规文件

相关推荐

Python TypeError：'Tensor' 对象不支持 TensorFlow 中的项目分配

在python中导入图像

将 Python 日志记录与 AWS Lambda 结合使用

将数据附加到python字典

相关推荐

最近更新

标签

Problems with `subprocess.PIPE`and Family

`subprocess.PIPE`和家人的问题

Alternative: `tempfile.TemporaryFile`

选择： `tempfile.TemporaryFile`