php LAMP:如何为用户动态创建 .Zip 大文件,而不会出现磁盘/CPU 抖动

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4357073/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 12:47:38  来源:igfitidea点击:

LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

phpbashzippipelamp

提问by Benji XVI

Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echoit to the user or save it to disk and redirect (deleting it some time in the future).

通常,Web 服务需要压缩多个大文件以供客户端下载。最明显的方法是创建一个临时 zip 文件,然后将echo其提供给用户或将其保存到磁盘并重定向(在未来某个时间删除它)。

However, doing things that way has drawbacks:

但是,这样做有缺点:

  • a initial phase of intensive CPU and disk thrashing, resulting in...
  • a considerable initial delay to the user while the archive is prepared
  • very high memory footprint per request
  • use of substantial temporary disk space
  • if the user cancels the download half way through, all resources used in the initial phase (CPU, memory, disk) will have been wasted
  • 密集的 CPU 和磁盘抖动的初始阶段,导致...
  • 在准备存档时对用户造成相当大的初始延迟
  • 每个请求非常高的内存占用
  • 大量临时磁盘空间的使用
  • 如果用户中途取消下载,那么初始阶段使用的所有资源(CPU、内存、磁盘)都将被浪费掉

Solutions like ZipStream-PHPimprove on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.

ZipStream-PHP这样的解决方案通过将数据逐个文件放入 Apache 文件来改进这一点。然而,结果仍然是高内存使用率(文件完全加载到内存中),以及磁盘和 CPU 使用率的大而剧烈的峰值。

In contrast, consider the following bash snippet:

相比之下,请考虑以下 bash 代码段:

ls -1 | zip -@ - | cat > file.zip
  # Note -@ is not supported on MacOS

Here, zipoperates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zipworks only as fast as its output can be written by cat.

在这里,zip以流模式运行,从而减少内存占用。管道有一个完整的缓冲区——当缓冲区已满时,操作系统会暂停写入程序(管道左侧的程序)。这可确保zip仅在其输出可由cat.

The optimal way, then, would be to do the same: replace catwith a web server process, streamingthe zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.

那么,最佳方法是执行相同的操作:替换cat为 Web 服务器进程,使用动态创建的 zip 文件将 zip 文件流式传输给用户。与仅流式传输文件相比,这将产生很少的开销,并且具有无问题、无尖峰的资源配置文件。

How can you achieve this on a LAMP stack?

您如何在 LAMP 堆栈上实现这一目标?

回答by Lee

You can use popen()(docs)or proc_open()(docs)to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush()(docs)will do its very best to push the contents of php's output buffer to the browser.

您可以使用popen()(docs)proc_open()(docs)来执行 unix 命令(例如 zip 或 gzip),并将标准输出作为 php 流返回。 flush()(docs)将尽最大努力将 php 输出缓冲区的内容推送到浏览器。

Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).

结合所有这些将给你你想要的东西(前提是没有其他东西妨碍 - 尤其参见文档页面上的警告flush())。

(Note: don't use flush(). See the update below for details.)

注意:请勿使用flush()。有关详细信息,请参阅下面的更新。)

Something like the following can do the trick:

像下面这样的东西可以解决问题:

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');

// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), ifyou were willing to get into the "down and dirty" of non-blocking file access and whatnot.

您询问了“其他技术”:对此我会说,“在请求的整个生命周期中支持非阻塞 i/o 的任何东西”。您可以使用 Java 或 C/C++(或许多其他可用语言中的任何一种)构建这样一个组件作为独立服务器,如果您愿意进入非阻塞文件访问等的“低级和肮脏”。

If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the httpmodule (of course) for the http server; and use child_processmodule to spawn the tar/zip/whatever pipeline.

如果你想要一个非阻塞的实现,但你宁愿避免“失败和肮脏”,最简单的路径(恕我直言)是使用nodeJS。在现有的 nodejs 版本中,您需要的所有功能都有大量支持:使用http模块(当然)作为 http 服务器;并使用child_process模块生成 tar/zip/whatever 管道。

Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.

最后,如果(且仅当)您正在运行多处理器(或多核)服务器,并且您希望从 nodejs 中获得最大收益,您可以使用Spark2在同一端口上运行多个实例。不要为每个处理器核心运行一个以上的 nodejs 实例。



Update(from Benji's excellent feedback in the comments section on this answer)

更新(来自 Benji 在此答案的评论部分中的出色反馈)

1.The docs for fread()indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.

1.文档fread()说明该函数一次最多只能从非常规文件中读取 8192 字节的数据。因此,8192 可能是缓冲区大小的不错选择。

[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread()will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.

[编者注] 8192 几乎肯定是一个平台相关的值——在大多数平台上,fread()将读取数据直到操作系统的内部缓冲区为空,此时它将返回,允许操作系统再次异步填充缓冲区。8192 是许多流行操作系统上默认缓冲区的大小。

There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread()will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.

还有其他情况可能导致 fread 返回甚至少于 8192 字节——例如,“远程”客户端(或进程)填充缓冲区的速度很慢——在大多数情况下,fread()将返回输入缓冲区的内容为——无需等待它装满。这可能意味着返回 0..os_buffer_size 字节的任何地方。

The moral is: the value you pass to fread()as buffsizeshould be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).

寓意是:传递给fread()as的值buffsize应该被视为“最大”大小——永远不要假设你已经收到了你要求的字节数(或任何其他数字)。

2.According to comments on fread docs, a few caveats: magic quotesmay interfere and must be turned off.

2.根据 fread docs 的评论,一些警告:魔术引号可能会干扰,必须关闭

3.Setting mb_http_output('pass')(docs)may be a good idea. Though 'pass'is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.

3.设置mb_http_output('pass')(文档)可能是个好主意。虽然'pass'已经是默认设置,但如果您的代码或配置之前已将其更改为其他内容,您可能需要明确指定它。

4.If you're creating a zip (as opposed to gzip), you'd want to use the content type header:

4.如果您要创建 zip(而不是 gzip),您需要使用内容类型标头:

Content-type: application/zip

or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):

或者...可以使用“application/octet-stream”代替。(它是用于所有不同类型的二进制下载的通用内容类型):

Content-type: application/octet-stream

and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):

如果您希望提示用户下载文件并将其保存到磁盘(而不是让浏览器尝试将文件显示为文本),那么您将需要 content-disposition 标头。(其中文件名表示应在保存对话框中建议的名称):

Content-disposition: attachment; filename="file.zip"

One should also send the Content-length header, but this is hard with this technique as you don't know the zip's exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?

还应该发送 Content-length 标头,但是使用这种技术很难做到这一点,因为您事先不知道 zip 的确切大小。 是否有可以设置指示内容是“流式传输”或长度未知的标头?有人知道吗?



Finally, here's a revised example that uses all of @Benji'ssuggestions (and that creates a ZIP file instead of a TAR.GZIP file):

最后,这是一个使用@ Benji 的所有建议的修订示例(并创建一个 ZIP 文件而不是 TAR.GZIP 文件):

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');

// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


Update: (2012-11-23) I have discovered that calling flush()within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.

更新:(2012-11-23) 我发现flush()在处理非常大的文件和/或非常慢的网络时,在 read/echo 循环中调用可能会导致问题。至少,在 Apache 后面将 PHP 作为 cgi/fastcgi 运行时确实如此,并且在其他配置中运行时似乎也会出现同样的问题。当 PHP 将输出刷新到 Apache 的速度比 Apache 通过套接字实际发送它的速度快时,就会出现这个问题。对于非常大的文件(或慢速连接),这最终会导致 Apache 的内部输出缓冲区溢出。这会导致 Apache 终止 PHP 进程,这当然会导致下载挂起或过早完成,只进行了部分传输。

The solution is notto call flush()at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

解决方案是根本不调用flush()。我已经更新了上面的代码示例以反映这一点,并在答案顶部的文本中放置了一个注释。

回答by Emiller

Another solution is my mod_zip module for Nginx, written specifically for this purpose:

另一个解决方案是我的用于 Nginx 的 mod_zip 模块,专门为此目的编写:

https://github.com/evanmiller/mod_zip

https://github.com/evamiller/mod_zip

It is extremely lightweight and does not invoke a separate "zip" process or communicate via pipes. You simply point to a script that lists the locations of files to be included, and mod_zip does the rest.

它非常轻巧,不会调用单独的“zip”进程或通过管道进行通信。您只需指向一个列出要包含的文件位置的脚本,然后 mod_zip 完成剩下的工作。

回答by Rico Sonntag

Trying to implement a dynamic generated download with lots of files with different sizes i came across this solution but i run into various memory errors like "Allowed memory size of 134217728 bytes exhausted at ...".

尝试使用大量不同大小的文件实现动态生成的下载时,我遇到了这个解决方案,但我遇到了各种内存错误,例如“允许的内存大小为 134217728 字节……”。

After adding ob_flush();right before the flush();the memory errors disappear.

在内存错误消失ob_flush();之前添加正确flush();

Together with sending the headers, my final solution looks like this (Just storing the files inside the zip without directory structure):

连同发送标头,我的最终解决方案如下所示(仅将文件存储在没有目录结构的 zip 中):

<?php

// Sending headers
header('Content-Type: application/zip');
header('Content-Disposition: attachment; filename="download.zip"');
header('Content-Transfer-Encoding: binary');
ob_clean();
flush();

// On the fly zip creation
$fp = popen('zip -0 -j -q -r - file1 file2 file3', 'r');

while (!feof($fp)) {
    echo fread($fp, 8192);
    ob_flush();
    flush();
}

pclose($fp);

回答by user3665185

I wrote this s3 steaming file zipper microservice last weekend - might be useful: http://engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/

我上周末写了这个 s3 流文件拉链微服务 - 可能有用:http: //engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/

回答by Josh Davis

According to the PHP manual, the ZIP extensionprovides a zip: wrapper.

根据PHP 手册ZIP 扩展提供了一个 zip: 包装器。

I have never used it and I don't know its internals, but logically it should be able to do what you're looking for, assuming that ZIP archives can be streamed, which I'm not entirely sure of.

我从未使用过它,我不知道它的内部结构,但从逻辑上讲,它应该能够做你正在寻找的东西,假设可以流式传输 ZIP 档案,我不完全确定。

As for your question about the "LAMP stack" it shouldn't be a problem as long as PHP is notconfigured to buffer output.

至于您关于“LAMP 堆栈”的问题,只要 PHP配置为缓冲 output就应该不是问题。



Edit: I'm trying to put a proof-of-concept together, but it seems not-trivial. If you're not experienced with PHP's streams, it might prove too complicated, if it's even possible.

编辑:我正在尝试将概念验证放在一起,但这似乎并不重要。如果您对 PHP 的流没有经验,那么它可能会被证明太复杂了,如果有可能的话。



Edit(2): rereading your question after taking a look at ZipStream, I found what's going to be your main problem here when you say (emphasis added)

编辑(2):在查看 ZipStream 后重新阅读您的问题,当您说(强调添加)时,我发现这里将是您的主要问题

the operative Zipping should operate in streaming mode, ie processing files and providing data at the rate of the download.

操作压缩应该在流模式下运行,即处理文件并以下载速率提供数据。

That part will be extremely hard to implement because I don't think PHP provides a way to determine how full Apache's buffer is. So, the answer to your question is no, you probably won't be able to do that in PHP.

这部分将非常难以实现,因为我认为 PHP 没有提供一种方法来确定 Apache 的缓冲区有多满。因此,您的问题的答案是否定的,您可能无法在 PHP 中做到这一点。

回答by Hermann

It seems, you can eliminate any output-buffer related problems by using fpassthru(). I also use -0to save CPU time since my data is compact already. I use this code to serve a whole folder, zipped on-the-fly:

看来,您可以使用fpassthru()消除任何与输出缓冲区相关的问题。我还-0用来节省 CPU 时间,因为我的数据已经很紧凑了。我使用此代码来提供整个文件夹,即时压缩:

chdir($folder);
$fp = popen('zip -0 -r - .', 'r');
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="'.basename($folder).'.zip"');
fpassthru($fp);