使用 PHP + Apache 快速生成 ZIP 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/990462/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 17:45:27  来源:igfitidea点击:

Generating ZIP files with PHP + Apache on-the-fly in high speed?

phpapachezip

提问by Vilx-

To quote some famous words:

引用一些名言

“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”

“程序员……经常躲在他们工作中一种可以理解但灾难性的对复杂性和独创性的倾向中。禁止设计比程序更大的任何东西,他们的回应是使程序复杂到足以挑战他们的专业技能。”

While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)

在解决工作中的一些平凡问题时,我想到了这个想法,但我不太确定如何解决。我知道我不会实施这个,但我很好奇最好的解决方案是什么。:)



Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.

假设您有一个包含 JPG 文件和一些奇怪的 SWF 文件的大集合。“大”的意思是“几千”。每个 JPG 文件大约为 200KB,而 SWF 的大小可达几 MB。每天都有一些新的 JPG 文件。因此,所有内容的总大小约为 1 GB,并且正在缓慢但稳定地增加。文件很少更改或删除。

The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.

用户可以在网页上单独查看每个文件。然而,也有希望允许他们一次下载一大堆。这些文件附有一些元数据(日期、类别等),用户可以根据这些元数据过滤集合。

The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.

最终的实现是允许用户指定一些过滤条件,然后将相应的文件下载为单个 ZIP 文件。

Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.

由于条件的数量足够大,我无法预先生成所有可能的 ZIP 文件,必须即时生成。另一个问题是下载可能非常大,对于连接速度较慢的用户来说,很可能需要一个小时或更长时间。因此,对“简历”的支持是必不可少的。

On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.

然而,从好的方面来说,ZIP 不需要压缩任何东西——无论如何,文件大多是 JPEG。因此,整个过程不应该比简单的文件下载更占用 CPU。

The problems then that I have identified are thus:

因此,我确定的问题是:

  • PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
  • With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
  • Will passing large amounts of file data through PHP not be a performance hit in itself?
  • PHP 有脚本执行超时。虽然它可以通过脚本本身进行更改,但是完全删除它会不会有问题?
  • 使用 resume 选项,过滤器结果可能因不同的 HTTP 请求而改变。这可以通过按时间顺序对结果进行排序来缓解,因为集合只会越来越大。然后,请求 URL 还将包括它最初创建的日期,并且脚本不会考虑比该日期更早的文件。这会足够吗?
  • 通过 PHP 传递大量文件数据本身不会对性能造成影响吗?

How would you implement this? Is PHP up to the task at all?

你将如何实现这一点?PHP 能胜任这项任务吗?



Added:添加:

By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.

到现在为止,有两个人建议将请求的 ZIP 文件存储在一个临时文件夹中,并从那里作为通常的文件提供它们。虽然这确实是一个显而易见的解决方案,但有几个实际考虑使这不可行。

The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.

ZIP 文件通常非常大,从几十兆字节到数百兆字节不等。用户请求“一切”也是完全正常的,这意味着 ZIP 文件的大小将超过 1 GB。此外,还有许多可能的过滤器组合,其中许多很可能被用户选择。

As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

因此,ZIP 文件的生成速度将非常缓慢(由于数据量和磁盘速度的原因),并且会多次包含整个集合。如果没有一些非常昂贵的 SCSI RAID 阵列,我看不出这个解决方案将如何工作。

回答by Hugh Bothwell

This may be what you need: http://pablotron.org/software/zipstream-php/

这可能是您需要的:http: //pablotron.org/software/zipstream-php/

This lib allows you to build a dynamic streaming zip file without swapping to disk.

该库允许您构建动态流 zip 文件而无需交换到磁盘。

回答by Hugh Bothwell

i have a download page, and made a zip class that is very similar to your ideas. my downloads are very big files, that can't be zipped properly with the zip classes out there.

我有一个下载页面,并制作了一个与您的想法非常相似的 zip 类。我的下载是非常大的文件,无法使用 zip 类正确压缩。

and i had similar ideas as you. the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.

我和你有类似的想法。放弃压缩的方法非常好,你甚至不需要更少的cpu资源,你节省内存,因为你不必接触输入文件并且可以通过它,你还可以计算所有东西,比如zip headers并且最终文件大小很容易,你可以跳转到每个位置并从这里生成以实现简历。

I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename. If you have already download the generated zip file the browser gets it from the local cache instead of the server. You can also adjust the download rate (for example 300KB/s). One can make zip comments. You can choose which files can be added and what not (for example thumbs.db).

我更进一步,我从所有输入文件 crc 生成一个校验和,并将其用作生成文件的电子标签以支持缓存,并作为文件名的一部分。如果您已经下载了生成的 zip 文件,浏览器将从本地缓存而不是服务器获取它。您还可以调整下载速率(例如 300KB/s)。可以进行 zip 注释。您可以选择可以添加哪些文件以及不可以添加哪些文件(例如 thumbs.db)。

But theres one problem that you can't overcome with the zip format completely. Thats the generation of the crc values. Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources. Not much for one person, but not recommend for professional use. I solved this with an extra crc value table that i generate with an extra script. I add this crc values per parameter to the zip class. With this, the class is ultra fast. Like a regular download script, as you mentioned.

但是有一个问题是您无法完全使用 zip 格式克服的。这就是 crc 值的生成。即使使用hash-file来解决内存问题,或者使用hash-update增量生成crc,也会占用大量cpu资源。一个人用的不多,但不建议专业人士使用。我用一个额外的 crc 值表解决了这个问题,我用一个额外的脚本生成了这个表。我将每个参数的这个 crc 值添加到 zip 类中。有了这个,课程是超快的。就像您提到的常规下载脚本一样。

My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt

我的 zip 课程正在进行中,您可以在这里查看:http: //www.ranma.tv/zip-class.txt

I hope i can help someone with that :)

我希望我能帮助某人:)

But i will discontinue this approach, i will reprogram my class to a tar class. With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all. And i don't need an extra mysql table any more. I think it makes the class easier to use, if you don't have to create an extra crc table for it. It's not so hard, because tars file structure is easier as the zip structure.

但是我将停止这种方法,我会将我的课程重新编程为 tar 课程。使用 tar 我不需要从文件中生成 crc 值,tar 只需要一些标头的校验和,仅此而已。而且我不再需要额外的 mysql 表。我认为如果您不必为它创建额外的 crc 表,它会使该类更易于使用。这并不难,因为 tars 文件结构比 zip 结构更容易。

PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?

PHP 有脚本执行超时。虽然它可以通过脚本本身进行更改,但是完全删除它会不会有问题?

If your script is safe and it closes on user abort, then you can remove it completely. But it would be safer, if you just renew the timeout on every file that you pass throught :)

如果您的脚本是安全的并且在用户中止时关闭,那么您可以完全删除它。但是如果你只是更新你通过的每个文件的超时时间会更安全:)

With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?

使用 resume 选项,过滤器结果可能因不同的 HTTP 请求而改变。这可以通过按时间顺序对结果进行排序来缓解,因为集合只会越来越大。然后,请求 URL 还将包括它最初创建的日期,并且脚本不会考虑比该日期更早的文件。这会足够吗?

Yes that would work. I had generated a checksum from the input file crc's. I used this as an e-tag and as part of the zip filename. If something changed, the user can't resume the generated zip, because the e-tag and filename changed together with the content.

是的,那会奏效。我已经从输入文件 crc's 生成了一个校验和。我将其用作电子标签和 zip 文件名的一部分。如果发生变化,用户将无法恢复生成的 zip,因为电子标签和文件名与内容一起更改。

Will passing large amounts of file data through PHP not be a performance hit in itself?

通过 PHP 传递大量文件数据本身不会对性能造成影响吗?

No, if you only pass throught it will not use much more then a regular download. Maybe 0.01% i don't know, its not much :) I assume because php don't do much with the data :)

不,如果你只是通过它不会使用更多的常规下载。也许 0.01% 我不知道,它并不多 :) 我认为是因为 php 对数据做的不多:)

回答by jitter

Use e.g. the PhpConcept Library Ziplibrary.

使用例如PhpConcept 库 Zip库。

Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.

您的网络服务器必须支持恢复,除非您不直接访问 zipfiles。如果您有一个 php 脚本作为中介,那么请注意发送正确的标头以支持恢复。

The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

创建文件的脚本不应超时,只是确保用户不能一次选择数千个文件。并保留一些东西以删除“旧 zipfiles”,并注意某些恶意用户不会通过请求许多不同的文件集合来占用您的磁盘空间。

回答by linead

You're going to have to store the generated zip file, if you want them to be able to resume downloads.

如果您希望它们能够继续下载,您将不得不存储生成的 zip 文件。

Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.

基本上,您生成 zip 文件并将其放入 /tmp 目录中,并使用可重复的文件名(可能是搜索过滤器的哈希)。然后您将正确的标头发送给用户并将 file_get_contents 发送给用户。

To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed hereand once your parsed that you'll need to run something like this.

为了支持恢复,您需要检查 $_SERVER['HTTP_RANGE'] 值,这里详细说明它的格式,一旦您解析,您将需要运行这样的东西。

$size = filesize($zip_file);

if(isset($_SERVER['HTTP_RANGE'])) {
    //parse http_range
    $range = explode( '-', $seek_range);
    $new_length = $range[1] - $range[0]
    header("HTTP/1.1 206 Partial Content");
    header("Content-Length: $new_length");
    header("Content-Range: bytes {$range[0]}-$range[1]");
    echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
    header("Content-Range: bytes 0-$size");
    header("Content-Length: ".$size);
    echo file_get_contents($zip_file);
} 

This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.

这是非常粗略的代码,您可能需要稍微处理一下 HTTP_RANGE 变量的标头和内容。如果您愿意,您可以使用 fopen 和 fwrite 而不是 file_get 内容,并且只需 fseek 到正确的位置。

Now to your questions

现在回答你的问题

  • PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
  • PHP 有脚本执行超时。虽然它可以通过脚本本身进行更改,但是完全删除它会不会有问题?

You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)

如果你愿意,你可以删除它,但是如果某些东西变成梨形并且你的代码陷入无限循环中,如果无限循环在某处记录和错误并且你没有注意到,那么可能会导致有趣的问题,直到一个脾气暴躁的人系统管理员想知道为什么他们的服务器硬盘空间不足;)

  • With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
  • 使用 resume 选项,过滤器结果可能因不同的 HTTP 请求而改变。这可以通过按时间顺序对结果进行排序来缓解,因为集合只会越来越大。然后,请求 URL 还将包括它最初创建的日期,并且脚本不会考虑比该日期更早的文件。这会足够吗?

Cache the file to the hard disk, means you wont have this problem.

将文件缓存到硬盘,就不会出现这个问题。

  • Will passing large amounts of file data through PHP not be a performance hit in itself?
  • 通过 PHP 传递大量文件数据本身不会对性能造成影响吗?

Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

是的,它不会像从网络服务器上常规下载一样快。但速度应该不会太慢。

回答by Frosty Z

You can use ZipStreamor PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.

您可以使用ZipStreamPHPZip,它们会将压缩文件动态发送到浏览器,分成块,而不是在 PHP 中加载整个内容,然后发送 zip 文件。

Both libraries are nice and useful pieces of code. A few details:

这两个库都是很好且有用的代码片段。一些细节:

  • ZipStream"works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
  • PHPZipwrites temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.
  • ZipStream仅在内存中“工作”,但在必要时不能轻松移植到 PHP 4(使用hash_file()
  • PHPZip在磁盘上写入临时文件(消耗的磁盘空间与要添加到 zip 中的最大文件一样多),但如有必要,可以轻松适应 PHP 4。