使用请求在 python 中下载大文件

Question

提问by Roman Podlinov

Requestsis a really nice library. I'd like to use it for download big files (>1GB). The problem is it's not possible to keep whole file in memory I need to read it in chunks. And this is a problem with the following code

Requests是一个非常好的库。我想用它来下载大文件（>1GB）。问题是不可能将整个文件保存在内存中，我需要分块读取它。这是以下代码的问题

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return

By some reason it doesn't work this way. It still loads response into memory before save it to a file.

由于某种原因，它不能以这种方式工作。在将响应保存到文件之前，它仍然将响应加载到内存中。

UPDATE

更新

If you need a small client (Python 2.x /3.x) which can download big files from FTP, you can find it here. It supports multithreading & reconnects (it does monitor connections) also it tunes socket params for the download task.

如果你需要一个可以从 FTP 下载大文件的小客户端（Python 2.x /3.x），你可以在这里找到。它支持多线程和重新连接（它确实监视连接）还为下载任务调整套接字参数。

Answer 1

采纳答案by Roman Podlinov

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:

使用以下流代码，无论下载的文件大小如何，都会限制 Python 内存使用：

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

Note that the number of bytes returned using iter_contentis not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.

请注意，使用返回的字节数iter_content不完全是chunk_size; 它应该是一个通常更大的随机数，并且预计在每次迭代中都会有所不同。

See https://requests.readthedocs.io/en/latest/user/advanced/#body-content-workflowand https://requests.readthedocs.io/en/latest/api/#requests.Response.iter_contentfor further reference.

请参阅https://requests.readthedocs.io/en/latest/user/advanced/#body-content-workflow和https://requests.readthedocs.io/en/latest/api/#requests.Response.iter_content了解更多参考。

Answer 2

回答by danodonovan

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use withto tidy up the syntax)

您的块大小可能太大，您是否尝试删除它 - 一次可能是 1024 个字节？（另外，你可以with用来整理语法）

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return

Incidentally, how are you deducing that the response has been loaded into memory?

顺便说一下，您如何推断响应已加载到内存中？

It sounds as if python isn't flushing the data to file, from other SO questionsyou could try f.flush()and os.fsync()to force the file write and free memory;

这听起来仿佛蟒蛇没有刷新数据文件，从其他SO问题，你可以尝试f.flush()，并os.fsync()迫使文件的写入和释放内存;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())

Answer 3

回答by John Zwinck

It's much easier if you use Response.rawand shutil.copyfileobj():

如果您使用Response.rawand会容易得多shutil.copyfileobj()：

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

这会在不使用过多内存的情况下将文件流式传输到磁盘，并且代码很简单。

Answer 4

回答by x-yuri

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:

不完全是 OP 的要求，但是......用urllib以下方法做到这一点非常容易：

from urllib.request import urlretrieve
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file:

或者这样，如果你想把它保存到一个临时文件：

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process:

我看了这个过程：

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?

我看到文件在增长，但内存使用量保持在 17 MB。我错过了什么吗？

使用请求在 python 中下载大文件

提问by Roman Podlinov

采纳答案by Roman Podlinov

回答by danodonovan

回答by John Zwinck

回答by x-yuri

相关推荐

最近更新

标签

使用请求在 python 中下载大文件

提问by Roman Podlinov

采纳答案by Roman Podlinov

回答by danodonovan

回答by John Zwinck

回答by x-yuri

相关推荐

python3 和 python3m 可执行文件的区别

Python 递归和列表

在 Python 中打印“批准”标志/复选标记 (?) U+2713

Python 3 如何“声明”一个空的`bytes` 变量

相关推荐

最近更新

标签