在java servlet中流式传输大文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/55709/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Streaming large files in a java servlet
提问by Steve Buikhuizen
I am building a java server that needs to scale. One of the servlets will be serving images stored in Amazon S3.
我正在构建一个需要扩展的 Java 服务器。其中一个 servlet 将提供存储在 Amazon S3 中的图像。
Recently under load, I ran out of memory in my VM and it was after I added the code to serve the images so I'm pretty sure that streaming larger servlet responses is causing my troubles.
最近在负载下,我的 VM 内存不足,这是在我添加代码来提供图像之后,所以我很确定流式传输更大的 servlet 响应会导致我的麻烦。
My question is : is there any best practice in how to code a java servlet to stream a large (>200k) response back to a browser when read from a database or other cloud storage?
我的问题是:当从数据库或其他云存储读取时,如何编写 java servlet 以将大型(> 200k)响应流式传输回浏览器,是否有任何最佳实践?
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
我已经考虑将文件写入本地临时驱动器,然后生成另一个线程来处理流,以便可以重新使用 tomcat servlet 线程。这似乎会很重。
Any thoughts would be appreciated. Thanks.
任何想法将不胜感激。谢谢。
采纳答案by John Vasileff
When possible, you should not store the entire contents of a file to be served in memory. Instead, aquire an InputStream for the data, and copy the data to the Servlet OutputStream in pieces. For example:
如果可能,您不应将要提供的文件的全部内容存储在内存中。相反,获取数据的 InputStream,并将数据分片复制到 Servlet OutputStream。例如:
ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
byte[] bytes = new byte[FILEBUFFERSIZE];
int bytesRead;
response.setContentType(mimeType);
while ((bytesRead = in.read(bytes)) != -1) {
out.write(bytes, 0, bytesRead);
}
// do the following in a finally block:
in.close();
out.close();
I do agree with toby, you should instead "point them to the S3 url."
我同意托比,你应该改为“将它们指向 S3 url”。
As for the OOM exception, are you sure it has to do with serving the image data? Let's say your JVM has 256MB of "extra" memory to use for serving image data. With Google's help, "256MB / 200KB" = 1310. For 2GB "extra" memory (these days a very reasonable amount) over 10,000 simultaneous clients could be supported. Even so, 1300 simultaneous clients is a pretty large number. Is this the type of load you experienced? If not, you may need to look elsewhere for the cause of the OOM exception.
至于OOM异常,你确定它与提供图像数据有关吗?假设您的 JVM 有 256MB 的“额外”内存用于提供图像数据。在 Google 的帮助下,“256MB / 200KB” = 1310。对于 2GB 的“额外”内存(现在是非常合理的数量),可以支持超过 10,000 个并发客户端。即便如此,1300 个并发客户端也是一个相当大的数字。这是您所经历的负载类型吗?如果没有,您可能需要在别处寻找 OOM 异常的原因。
Edit - Regarding:
编辑 - 关于:
In this use case the images can contain sensitive data...
在这个用例中,图像可能包含敏感数据......
When I read through the S3 documentation a few weeks ago, I noticed that you can generate time-expiring keys that can be attached to S3 URLs. So, you would not have to open up the files on S3 to the public. My understanding of the technique is:
几周前,当我通读 S3 文档时,我注意到您可以生成可附加到 S3 URL 的过期密钥。因此,您不必向公众开放 S3 上的文件。我对技术的理解是:
- Initial HTML page has download links to your webapp
- User clicks on a download link
- Your webapp generates an S3 URL that includes a key that expires in, lets say, 5 minutes.
- Send an HTTP redirect to the client with the URL from step 3.
- The user downloads the file from S3. This works even if the download takes more than 5 minutes - once a download starts it can continue through completion.
- 初始 HTML 页面具有指向您的 web 应用程序的下载链接
- 用户点击下载链接
- 您的 web 应用程序生成一个 S3 URL,其中包含一个在 5 分钟后过期的密钥。
- 使用步骤 3 中的 URL 向客户端发送 HTTP 重定向。
- 用户从 S3 下载文件。即使下载时间超过 5 分钟,这也有效 - 一旦下载开始,它可以继续完成。
回答by airportyh
Why wouldn't you just point them to the S3 url? Taking an artifact from S3 and then streaming it through your own server to me defeats the purpose of using S3, which is to offload the bandwidth and processing of serving the images to Amazon.
为什么不直接将它们指向 S3 网址?从 S3 获取工件,然后通过您自己的服务器将其流式传输给我,这违背了使用 S3 的目的,即卸载向 Amazon 提供图像的带宽和处理。
回答by Marcio Aguiar
You have to check two things:
你必须检查两件事:
- Are you closing the stream? Very important
- Maybe you're giving stream connections "for free". The stream is not large, but many many streams at the same time can steal all your memory. Create a pool so that you cannot have a certain number of streams running at the same time
- 你要关闭流吗?很重要
- 也许您正在“免费”提供流连接。流并不大,但同时许多流可以窃取您所有的内存。创建一个池,以便您不能同时运行一定数量的流
回答by Tony BenBrahim
toby is right, you should be pointing straight to S3, if you can. If you cannot, the question is a little vague to give an accurate response:
How big is your java heap? How many streams are open concurrently when you run out of memory?
How big is your read write/bufer (8K is good)?
You are reading 8K from the stream, then writing 8k to the output, right? You are not trying to read the whole image from S3, buffer it in memory, then sending the whole thing at once?
托比是对的,如果可以的话,您应该直接指向 S3。如果你不能,这个问题有点模糊,无法给出准确的回答:你的 Java 堆有多大?当内存不足时,有多少流同时打开?
您的读写/缓冲区有多大(8K 好)?
您正在从流中读取 8K,然后将 8k 写入输出,对吗?您不是要从 S3 读取整个图像,将其缓冲在内存中,然后立即发送整个图像吗?
If you use 8K buffers, you could have 1000 concurrent streams going in ~8Megs of heap space, so you are definitely doing something wrong....
如果您使用 8K 缓冲区,则可能有 1000 个并发流进入大约 8Megs 的堆空间,所以您肯定做错了什么......
BTW, I did not pick 8K out of thin air, it is the default size for socket buffers, send more data, say 1Meg, and you will be blocking on the tcp/ip stack holding a large amount of memory.
顺便说一句,我不是凭空选择 8K,它是套接字缓冲区的默认大小,发送更多数据,比如 1Meg,你将阻塞在持有大量内存的 tcp/ip 堆栈上。
回答by Johannes Passing
In addition to what John suggested, you should repeatedly flush the output stream. Depending on your web container, it is possible that it caches parts or even all of your output and flushes it at-once (for example, to calculate the Content-Length header). That would burn quite a bit of memory.
除了约翰建议的内容之外,您还应该反复刷新输出流。根据您的 Web 容器,它可能会缓存部分甚至全部输出并立即刷新(例如,计算 Content-Length 标头)。那会消耗相当多的内存。
回答by Stu Thompson
I agree strongly with both toby and John Vasileff--S3 is great for off loading large media objects if you can tolerate the associated issues. (An instance of own app does that for 10-1000MB FLVs and MP4s.) E.g.: No partial requests (byte range header), though. One has to handle that 'manually', occasional down time, etc..
我非常同意 toby 和 John Vasileff 的观点——如果你能容忍相关的问题,S3 非常适合卸载大型媒体对象。(自己的应用程序的实例为 10-1000MB FLV 和 MP4 执行此操作。)例如:不过,没有部分请求(字节范围标头)。人们必须处理“手动”,偶尔的停机时间等。
If that is not an option, John's code looks good. I have found that a byte buffer of 2k FILEBUFFERSIZE is the most efficient in microbench marks. Another option might be a shared FileChannel. (FileChannels are thread-safe.)
如果这不是一个选项,John 的代码看起来不错。我发现 2k FILEBUFFERSIZE 的字节缓冲区是微基准测试中最有效的。另一种选择可能是共享 FileChannel。(文件通道是线程安全的。)
That said, I'd also add that guessing at what caused an out of memory error is a classic optimization mistake. You would improve your chances of success by working with hard metrics.
也就是说,我还要补充一点,猜测导致内存不足错误的原因是一个经典的优化错误。通过使用硬指标,您将提高成功的机会。
- Place -XX:+HeapDumpOnOutOfMemoryError into you JVM startup parameters, just in case
- take use jmap on the running JVM (jmap -histo <pid>) under load
- Analyize the metrics (jmap -histo out put, or have jhat look at your heap dump). It very well may be that your out of memory is coming from somewhere unexpected.
- 将 -XX:+HeapDumpOnOutOfMemoryError 放入 JVM 启动参数中,以防万一
- 在负载下在正在运行的 JVM ( jmap -histo <pid>)上使用 jmap
- 分析指标(jmap -histo 输出,或者让 jhat 查看您的堆转储)。很可能是您的记忆不足来自意想不到的地方。
There are of course other tools out there, but jmap & jhat come with Java 5+ 'out of the box'
当然还有其他工具,但是 jmap 和 jhat 随 Java 5+ 一起“开箱即用”
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
我已经考虑将文件写入本地临时驱动器,然后生成另一个线程来处理流,以便可以重新使用 tomcat servlet 线程。这似乎会很重。
Ah, I don't think you can't do that. And even if you could, it sounds dubious. The tomcat thread that is managing the connection needs to in control. If you are experiencing thread starvation then increase the number of available threads in ./conf/server.xml. Again, metrics are the way to detect this--don't just guess.
啊,我不认为你不能那样做。即使你可以,这听起来也很可疑。管理连接的 tomcat 线程需要控制。如果您遇到线程饥饿问题,请增加 ./conf/server.xml 中的可用线程数。同样,指标是检测这一点的方法——不要只是猜测。
Question: Are you also running on EC2? What are your tomcat's JVM start up parameters?
问题:您是否也在 EC2 上运行?你的tomcat的JVM启动参数是什么?
回答by Emil Sit
If you can structure your files so that the static files are separate and in their own bucket, the fastest performance today can likely be achieved by using the Amazon S3 CDN, CloudFront.
如果您可以构建您的文件,使静态文件独立并位于它们自己的存储桶中,那么今天的最快性能可能可以通过使用 Amazon S3 CDN CloudFront来实现。
回答by blast_hardcheese
I've seen a lot of code like john-vasilef's (currently accepted) answer, a tight while loop reading chunks from one stream and writing them to the other stream.
我见过很多代码,比如 john-vasilef(目前接受的)的答案,从一个流中读取块并将它们写入另一个流的紧密 while 循环。
The argument I'd make is against needless code duplication, in favor of using Apache's IOUtils. If you are already using it elsewhere, or if another library or framework you're using is already depending on it, it's a single line that is known and well-tested.
我要提出的论点是反对不必要的代码重复,赞成使用 Apache 的 IOUtils。如果您已经在其他地方使用它,或者您正在使用的另一个库或框架已经依赖于它,那么它是已知且经过充分测试的单行。
In the following code, I'm streaming an object from Amazon S3 to the client in a servlet.
在以下代码中,我将一个对象从 Amazon S3 流式传输到 servlet 中的客户端。
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.io.IOUtils;
InputStream in = null;
OutputStream out = null;
try {
in = object.getObjectContent();
out = response.getOutputStream();
IOUtils.copy(in, out);
} finally {
IOUtils.closeQuietly(in);
IOUtils.closeQuietly(out);
}
6 lines of a well-defined pattern with proper stream closing seems pretty solid.
6 行定义明确的模式与适当的流关闭似乎非常可靠。