Python 使用 subprocess.Popen 的非常大的输入和管道

Question

提问by seandavi

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?

我有很简单的问题。我有一个大文件，它经历了三个步骤，一个使用外部程序的解码步骤，在 python 中的一些处理，然后使用另一个外部程序重新编码。我一直在使用 subprocess.Popen() 尝试在 python 中执行此操作，而不是形成 unix 管道。但是，所有数据都缓冲到内存中。是否有执行此任务的 pythonic 方法，或者我最好退回到一个简单的 python 脚本，该脚本从 stdin 读取并写入到 stdout，两边都有 unix 管道？

import os, sys, subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout.read():
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Answer 1

采纳答案by user470379

Popen has a bufsizeparameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdinand stdoutparameters. From the subprocess docs:

Popen 有一个bufsize参数可以限制内存中缓冲区的大小。如果您根本不需要内存中的文件，您可以将文件对象作为stdin和stdout参数传递。从子流程文档：

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).

bufsize，如果给定，则与内置 open() 函数的相应参数具有相同的含义：0 表示无缓冲，1 表示行缓冲，任何其他正值表示使用（大约）该大小的缓冲区。负的 bufsize 意味着使用系统默认值，这通常意味着完全缓冲。bufsize 的默认值为 0（无缓冲）。

Answer 2

回答by André Caron

However, all the data are buffered to memory ...

但是，所有数据都缓冲到内存中......

Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and thenreturn it to you. As you've pointed out, this is problematic if dealing with very large files.

你在用subprocess.Popen.communicate()吗？按照设计，此函数将等待进程完成，同时在缓冲区中累积数据，然后将其返回给您。正如您所指出的，如果处理非常大的文件，这是有问题的。

If you want to process the data while it is generated, you will need to write a loop using the poll()and .stdout.read()methods, then write that output to another socket/file/etc.

如果要在生成数据时对其进行处理，则需要使用poll()和.stdout.read()方法编写一个循环，然后将该输出写入另一个套接字/文件/等。

Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).

一定要注意文档中反对这样做的警告，因为它很容易导致死锁（父进程等待子进程生成数据，子进程又在等待父进程清空管道缓冲区） .

Answer 3

回答by anijhaw

Try to make this small change, see if the efficiency is better.

尝试做这个小改动，看看效率是否更好。

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Answer 4

回答by seandavi

I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.

我在标准输出流上使用 .read() 方法。相反，我只需要直接从上面 for 循环中的流中读取。更正后的代码符合我的预期。

#!/usr/bin/env python
import os
import sys
import subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Answer 5

回答by mauricio777

Trying to do some basic shell piping with very large input in python:

尝试在 python 中用非常大的输入做一些基本的 shell 管道：

svnadmin load /var/repo < r0-100.dump

I found the simplest way to get this working even with large (2-5GB) files was:

我发现即使使用大（2-5GB）文件也能正常工作的最简单方法是：

subprocess.check_output('svnadmin load %s < %s' % (repo, fname), shell=True)

I like this method because it's simple and you can do standard shell redirection.

我喜欢这种方法，因为它很简单，而且您可以进行标准的 shell 重定向。

I tried going the Popen route to run a redirect:

我尝试使用 Popen 路由来运行重定向：

cmd = 'svnadmin load %s' % repo
p = Popen(cmd, stdin=PIPE, stdout=PIPE, shell=True)
with open(fname) as inline:
    for line in inline:
        p.communicate(input=line)

But that broke with large files. Using:

但这打破了大文件。使用：

p.stdin.write()

Also broke with very large files.

也打破了非常大的文件。

Python 使用 subprocess.Popen 的非常大的输入和管道

提问by seandavi

采纳答案by user470379

回答by André Caron

回答by anijhaw

回答by seandavi

回答by mauricio777

相关推荐

最近更新

标签

Python 使用 subprocess.Popen 的非常大的输入和管道

提问by seandavi

采纳答案by user470379

回答by André Caron

回答by anijhaw

回答by seandavi

回答by mauricio777

相关推荐

在 Python 中查找扩展名为 .txt 的目录中的所有文件

在 Python 中不是 None 测试

Python 如何按指定规则更改边的权重？

Python 从图像中读取文本

相关推荐

最近更新

标签