Java 堆被无法访问的对象淹没
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14370738/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java heap overwhelmed by unreachable objects
提问by Pawel Veselov
We have started having some serious problems with our Java EE application. Specifically, the application runs up to 99% of old generation heap within minutes after start up. No OOMs are thrown, but effectively the JVM is unresponsive. The jstat shows that old generation does not decrease in size at all, no garbage collection is going on, and kill -3 says:
我们的 Java EE 应用程序开始出现一些严重的问题。具体来说,应用程序在启动后几分钟内运行了高达 99% 的老年代堆。没有抛出 OOM,但实际上 JVM 没有响应。jstat 显示老年代的大小根本没有减少,没有垃圾收集正在进行,kill -3 说:
Heap
PSYoungGen total 682688K, used 506415K [0xc1840000, 0xf3840000, 0xf3840000)
eden space 546176K, 92% used [0xc1840000,0xe06cd020,0xe2da0000)
from space 136512K, 0% used [0xe2da0000,0xe2da0000,0xeb2f0000)
to space 136512K, 0% used [0xeb2f0000,0xeb2f0000,0xf3840000)
PSOldGen total 1536000K, used 1535999K [0x63c40000, 0xc1840000, 0xc1840000)
object space 1536000K, 99% used [0x63c40000,0xc183fff8,0xc1840000)
The VM options are :
虚拟机选项是:
-Xmx2300m -Xms2300m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:SurvivorRatio=4 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseParallelGC -XX:ParallelGCThreads=4
(I changed it from having 2300m heap/1800m new gen, as an attempt to resolve the problem)
(我将其从 2300m 堆/1800m 新代更改为尝试解决问题)
I took a heap dump of the JVM once it got to "out of memory" state (took forever) and ran Eclipse Memory Analyzer on it.
一旦 JVM 进入“内存不足”状态(永远用完),我就对其进行了堆转储,并在其上运行了 Eclipse 内存分析器。
The results are quite funny. About 200Mb is occupied by objects of all kinds (there are some that own more than others), but the rest, 1.9Gb are all unreachable (may be noteworthy to say that majority is occupied by GSON objects, but I don't think it's an indication of anything, only says that we churn through a lot of GSON objects during server operation).
结果很有趣。大约 200Mb 被各种对象占用(有些对象比其他对象拥有更多),但其余的 1.9Gb 都无法访问(可能值得注意的是,大多数被 GSON 对象占用,但我认为不是任何事情的迹象,只是说我们在服务器操作期间搅动了很多 GSON 对象)。
Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?
关于为什么 VM 有这么多无法访问的对象并且根本无法收集它们的任何解释?
JVM:
虚拟机:
$ /0/bin/java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) Server VM (build 20.12-b01, mixed mode)
When the system gets to this stall, here is what verbose GC keep printing out:
当系统进入这个停顿时,下面是详细的 GC 不断打印出来的内容:
922.485: [GC [1 CMS-initial-mark: 511999K(512000K)] 1952308K(2048000K), 3.9069700 secs] [Times: user=3.91 sys=0.00, real=3.91 secs]
926.392: [CMS-concurrent-mark-start]
927.401: [Full GC 927.401: [CMS927.779: [CMS-concurrent-mark: 1.215/1.386 secs] [Times: user=5.84 sys=0.13, real=1.38 secs] (concurrent mode failure): 511999K->511999K(512000K), 9.4827600 secs] 2047999K->1957315K(2048000K), [CMS Perm : 115315K->115301K(262144K)], 9.4829860 secs] [Times: user=9.78 sys=0.01, real=9.49 secs]
937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs]
SOLVED
解决了
As Paul Bellora suggested, this was caused by too large amount of objects created within the JVM, in too short period of a time. Debugging becomes quite tedious at this point. What I ended up doing, is, instrumenting the classes using custom JVM agent. The instrumentation would count method and constructor invocations. The counts were then examined. I found that an inconspicuous single operation would create about 2 million objects, and trigger certain individual methods about 1.5 million times (no, there were no loops). The operation itself was identified by being slow comparing to others. You can use any hotspot profiler as well (something like visualVM), but I had all kinds of troubles with those, so ended up writing my own.
正如 Paul Bellora 所建议的,这是由于在 JVM 中创建的对象数量过多,时间过短造成的。此时调试变得相当乏味。我最终做的是,使用自定义 JVM 代理检测类。检测将计算方法和构造函数调用。然后检查计数。我发现一个不起眼的单个操作会创建大约 200 万个对象,并触发某些单个方法大约 150 万次(不,没有循环)。与其他操作相比,操作本身很慢。您也可以使用任何热点分析器(类似于visualVM),但是我遇到了各种各样的麻烦,所以最终自己编写了。
I still think the behavior of the JVM is a mystery. It looks like the garbage collector comes to a stall, and will not clean any more memory, yet the memory allocator expects it to (and thus no OOMs are thrown). Instead, I would have expected it to clear out all that unreachable memory. But the application behavior wouldn't be much better off, as majority of the time would have been spent garbage collecting anyway.
我仍然认为 JVM 的行为是一个谜。看起来垃圾收集器停止了,并且不会再清理任何内存,但内存分配器希望它这样做(因此不会抛出 OOM)。相反,我原以为它会清除所有无法访问的内存。但是应用程序的行为也不会好到哪里去,因为无论如何大部分时间都会花在垃圾收集上。
The agent that I used for help can be found here : https://github.com/veselov/MethodCountAgent. It's far away from being a polished piece of software.
我用来帮助的代理可以在这里找到:https: //github.com/veselov/MethodCountAgent。它远不是一款精美的软件。
采纳答案by Paul Bellora
Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?
关于为什么 VM 有这么多无法访问的对象并且根本无法收集它们的任何解释?
(Based on our exchange in the comments) it sounds like this is not a traditional memory leak but some piece of logic that is continuously spamming new objects such that the GC struggles to keep up under the current architecture.
(基于我们在评论中的交流)听起来这不是传统的内存泄漏,而是一些不断向新对象发送垃圾邮件的逻辑,使得 GC 努力跟上当前架构。
The culprit could be for example some API request that is being made many, many times, or else is "stuck" in some erroneous state like the infinite pagination scenario I described. What either situation boils down to is millions of response gson objects (which point to String
s (which point to char[]
s)) being instantiated and then becoming eligible for GC.
例如,罪魁祸首可能是某个 API 请求被多次发出,或者“卡在”某些错误状态,如我描述的无限分页场景。这两种情况都归结为数百万个响应 gson 对象(指向String
s(指向char[]
s))被实例化,然后有资格进行 GC。
As I said you should try and isolate problem request(s), then debug and take measurements to decide whether this is a bug or scalability issue on the part of your application or one of its libraries.
正如我所说,您应该尝试隔离问题请求,然后进行调试并进行测量,以确定这是否是您的应用程序或其库之一的错误或可扩展性问题。
回答by John Vint
Based on your stats listed, I find it hard to believe you have 1.9G of unreachable data. It looks more like a GC Overhead Limit Reached.
根据您列出的统计数据,我很难相信您有 1.9G 无法访问的数据。它看起来更像是达到 GC 开销限制。
Consider
考虑
937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs]
937.746:[全GC 937.746:[CMS:512000K-> 511999K(512000K),8.8891390秒] 2047999K-> 1962252K(2048000K),[CMS彼尔姆:115302K-> 115302K(262144K)],8.8893810秒] [时间:用户= 8.89 系统=0.01,真实=8.89 秒]
If this is true, then a Full GC releases 85K of data. If you did have 1.9G of unreachable code, you would see 2047999 -> ~300000
.
如果这是真的,那么 Full GC 会释放 85K 的数据。如果您确实有 1.9G 无法访问的代码,您会看到2047999 -> ~300000
.
Also
还
object space 1536000K, 99%
Implies something was created and stored to in such a way it escaped a method and is now living probably forever.
暗示某些东西是以某种方式被创建和存储的,它逃脱了一个方法,现在可能永远存在。
I would need to see more evidence that you have 1.9G of unreachable data other then simply being told.
我需要更多的证据证明你有 1.9G 无法访问的数据,而不是简单地被告知。