java java中如何分析内存碎片?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1253388/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to analyze memory fragmentation in java?
提问by Vitaly
We experience several minutes lags in our server. Probably they are triggered by "stop the world" garbage collections. But we use concurrent mark and sweep GC (-XX:+UseConcMarkSweepG) so, I think, these pauses are triggered by memory fragmentation of old generation.
我们在服务器中遇到了几分钟的延迟。可能它们是由“停止世界”垃圾收集触发的。但是我们使用并发标记和清除 GC (-XX:+UseConcMarkSweepG) 所以,我认为,这些暂停是由老年代的内存碎片触发的。
How can memory fragmentation of old generation be analyzed? Are there any tools for it?
如何分析old generation的内存碎片?有什么工具吗?
Lags happen every hour. Most time they are about 20 sec, but sometimes - several minutes.
每小时都会发生延迟。大多数时候它们大约是 20 秒,但有时 - 几分钟。
采纳答案by Stephen C
Look at your Java documentation for the "java -X..." options for turning on GC logging. That will tell you whether you are collecting old or new generation, and how long the collections are taking.
查看您的 Java 文档,了解用于打开 GC 日志记录的“java -X...”选项。这将告诉您收集的是旧世代还是新一代,以及收集需要多长时间。
A pause of "several minutes" sounds extraordinary. Are you sure that you aren't just running with a heap size that is too small, or on a machine with not enough physical memory?
“几分钟”的停顿听起来非同寻常。您确定您不只是在堆大小太小的情况下运行,或者在物理内存不足的机器上运行吗?
If your heap too close to full, the GC will be triggered again and again, resulting in your server spending most of its CPU time in the GC. This will show up in the GC logs.
If you use a large heap on a machine with not enough physical memory, a full GC is liable to cause your machine to "thrash", spending most of its time madly moving virtual memory pages to and from disc. You can observe this using system monitoring tools; e.g. by watching the console output from "vmstat 5" on a typical UNIX/Linux system.
如果你的堆太接近满了,GC 将被一次又一次地触发,导致你的服务器将大部分 CPU 时间花在 GC 上。这将显示在 GC 日志中。
如果你在一台没有足够物理内存的机器上使用一个大堆,一个完整的 GC 很可能导致你的机器“颠簸”,大部分时间都在疯狂地将虚拟内存页面移入和移出磁盘。您可以使用系统监控工具观察这一点;例如,通过在典型的 UNIX/Linux 系统上观察“vmstat 5”的控制台输出。
FOLLOWUP
跟进
Contrary to the OP's belief, turning on GC logging is unlikelyto make a noticeable difference to performance.
与 OP 的看法相反,打开 GC 日志记录不太可能对性能产生明显的影响。
The Understanding Concurrent Mark Sweep Garbage Collector Logspage on the Oracle site should be helpful in interpreting GC logs.
Oracle 站点上的“了解并发标记清除垃圾收集器日志”页面应该有助于解释 GC 日志。
Finally, the OP's conclusion that this is a "fragmentation" problem is unlikely, and (IMO) unsupported by the snippets of evidence that he has provided. It is most likely something else.
最后,OP 认为这是一个“碎片化”问题的结论不太可能,而且(IMO)也没有得到他提供的证据片段的支持。它很可能是别的东西。
回答by Vladimir Ralev
For low-level monitoring you will want to use this -XX:PrintFLSStatistics=1(or make it 2 for more at more blocking cost) . It's undocumented and occasionally gives you some stats. Unfortunately it's not very useful in most applications for different reasons, but it's at least ballpark-useful.
对于低级监控,您将需要使用它-XX:PrintFLSStatistics=1(或以更多的阻塞成本将其设为2)。它没有记录,偶尔会给你一些统计数据。不幸的是,由于不同的原因,它在大多数应用程序中并不是很有用,但它至少是非常有用的。
You should be able to see for example
例如,您应该能够看到
Max Chunk Size: 215599441
and compare it to this
并将其与此进行比较
Total Free Space: 219955840
and then judge the fragmentation based on the average block sizes and number of blocks.
然后根据平均块大小和块数判断碎片。
回答by Alex Punnen
This is a bit of a hard problem to find out. Since I had spend sometime in a system to find this out and prove, let me list out the scenario where this happened
这是一个很难找出的问题。由于我花了一些时间在系统中找出并证明了这一点,让我列出发生这种情况的场景
- We were stuck with using Java 6 , which did not have any compacting Garbage collector
- Our application was doing too much GC mostly young generation collection and some big old generation collecition
- Our heap-size was pretty big- main problem ( we reduced, but our application was guzzling on too many strings and collections)
- 我们坚持使用 Java 6 ,它没有任何压缩垃圾收集器
- 我们的应用程序做了太多的 GC,主要是年轻代收集和一些大的老年代收集
- 我们的堆大小是相当大的主要问题(我们减少了,但我们的应用程序在太多的字符串和集合上大吃大喝)
The problem that manifested was that only one particular algorithm in our system was running slow; the rest all which were running at the same time, was running quite normally. This ruled out Full GC ; Also we were using jstat and other j** tools to check GC, thread dumps + tailing the GC logs.
表现出来的问题是我们系统中只有一种特定算法运行缓慢;其余的都在同时运行,运行非常正常。这排除了 Full GC ;我们还使用 jstat 和其他 j** 工具来检查 GC、线程转储 + 跟踪 GC 日志。
From jstack thread dumps , taken for some time, we could get an idea which code block was really slowing. So the doubt fell to heap fragmentation.
从 jstack thread dumps 中,我们可以了解到哪个代码块真正变慢了。所以怀疑就落在了堆碎片上。
To test that I wrote a simple program that initialized two List one ArrayList and one LinkedList and did add operations causing resize. This test I could execute via REST handle. Normally there is not much difference. But inside a fragmented heap there is a clear difference seen in timing; a big collection resize with ArrayList becomes very slow than with Linked list. These timings were logged, and there were no other explanation to this than a fragmented head.
为了测试我编写了一个简单的程序,它初始化了两个 List 一个 ArrayList 和一个 LinkedList 并添加了导致调整大小的操作。这个测试我可以通过 REST 句柄执行。通常没有太大区别。但是在碎片堆中,时间上有明显的差异;使用 ArrayList 调整大集合的大小变得比使用链表慢。这些时间都被记录了下来,除了一个破碎的头之外,没有其他解释。
With Java 7, we shifted to G1GC, along with lot of work in GC tuning and improving applications; Here heap compaction is much better and it can handle bigger heaps, though I guess anything over 16 g heap will land you in places you don't really want to go- GC suckage :)
在 Java 7 中,我们转向 G1GC,同时在 GC 调优和改进应用程序方面做了大量工作;这里堆压缩要好得多,它可以处理更大的堆,但我猜任何超过 16 g 的堆都会让你进入你不想去的地方 - GC 吸吮 :)
回答by Alex Punnen
Vitaly, There is fragmentation problem. My observation: If there are small size of the objects which are getting updated frequently then in that case it generates lot of garbage. Though CMS collects the memory occupied by these objects, this memory is fragmented. Now Mark-Sweep-Compact thread comes into picture (stop the world)and try to compact this fragmented memory causing long pause.
Vitaly,存在碎片问题。我的观察:如果经常更新的对象很小,那么在这种情况下它会产生大量垃圾。虽然 CMS 收集了这些对象占用的内存,但这些内存是碎片化的。现在 Mark-Sweep-Compact 线程进入画面(停止世界)并尝试压缩导致长时间暂停的碎片内存。
Opposite to that if the objects size is bigger then it generates less fragmented memory and
Mark-Swap-Compact takes less time to compact this memory. This may cause less throughput but will help you to reduce the long pause caused by GC compaction.
与此相反,如果对象大小较大,则它会生成较少的碎片内存,并且
Mark-Swap-Compact 需要更少的时间来压缩此内存。这可能会导致吞吐量降低,但会帮助您减少由 GC 压缩引起的长时间停顿。
回答by Eric J.
To see how Vitaly probably handled this, see Understanding Concurrent Mark Sweep Garbage Collector Logs.
要了解 Vitaly 可能如何处理此问题,请参阅了解并发标记清除垃圾收集器日志。
回答by Aaron Digulla
There is no memory fragmentation in Java; during the GC run, memory areas are compacted.
Java 中没有内存碎片;在 GC 运行期间,内存区域被压缩。
Since you don't see a high CPU utilization, there is no GC running, either. So something else must be the cause of your problems. Here are a few ideas:
由于您没有看到高 CPU 利用率,因此也没有运行 GC。因此,您的问题一定是由其他原因引起的。这里有一些想法:
If the database of your application is on a different server, there may be network problems
If you run Windows and you have mapped network drives, one of the drives may lock up your computer (again network problems). The same is true for NFS drives on Unix. Check the system log for network errors.
Is the computer swapping lots of data to disk? Since CPU util is low, the cause of the problem could be that the app was swapped to disk and the GC run forced it back into RAM. This will take a long time if your server hasn't enough real RAM to keep the whole Java app in RAM.
如果您的应用程序的数据库在不同的服务器上,则可能存在网络问题
如果您运行 Windows 并且映射了网络驱动器,其中一个驱动器可能会锁定您的计算机(同样是网络问题)。Unix 上的 NFS 驱动器也是如此。检查系统日志是否有网络错误。
计算机是否将大量数据交换到磁盘?由于 CPU 利用率低,问题的原因可能是应用程序被交换到磁盘并且 GC 运行迫使它回到 RAM。如果您的服务器没有足够的实际 RAM 来将整个 Java 应用程序保存在 RAM 中,这将需要很长时间。
Also, other processes can force the app out of RAM. Check the real memory utilization and your swap space usage.
此外,其他进程可能会强制应用程序耗尽 RAM。检查实际内存使用情况和交换空间使用情况。
To understand the output of the GC log, this postmight help.
要了解 GC 日志的输出,这篇文章可能会有所帮助。
[EDIT] I still can't get my head around "low CPU" and "GC stalls". Those two usually contradict each other. If the GC is stalling, you must see 100% CPU usage. If the CPU is idle, then something else is blocking the GC. Do you have objects which overload finalize()? If a finalize blocks, the GC can take forever.
[编辑] 我仍然无法理解“低 CPU”和“GC 停顿”。这两者通常是相互矛盾的。如果 GC 停止,您必须看到 100% 的 CPU 使用率。如果 CPU 空闲,那么其他东西正在阻塞 GC。你有重载的对象finalize()吗?如果 finalize 阻塞,则 GC 可能需要永远。

