java 挂起 JVM 消耗 100% CPU

Question

提问by Bogdan

We have a JAVA server running on Sun JRE 6u20 on Linux 32-bit (CentOS). We use the Server Hotspot with CMS collector with the following options (I've only provided the relevant ones):

我们有一个在 Linux 32 位（CentOS）上运行 Sun JRE 6u20 的 JAVA 服务器。我们使用带有 CMS 收集器的服务器热点和以下选项（我只提供了相关的选项）：

-Xmx896m -Xss128k -XX:NewSize=384M -XX:MaxPermSize=96m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

Sometimes, after running for a while, the JVM seems to slip into a hung state, whereby even though we don't make any requests to the application, the CPU continues to spin at 100% (we have 8 logical CPUs, so it looks like only one CPU does the spinning). In this state the JVM doesn't respond to SIGHUP signals (kill -3) and we can't connect to it normally with jstack. We CAN connect with "jstack -F", but the output is dodgy (we can see lots of NullPointerExceptions from JStack apparently because it wasn't able to 'walk' some stacks). So the "jstack -F" output seems to be useless.

有时，运行一段时间后，JVM 似乎陷入了挂起状态，即使我们没有向应用程序发出任何请求，CPU 也会继续以 100% 的速度旋转（我们有 8 个逻辑 CPU，因此看起来就像只有一个 CPU 进行旋转一样）。在这种状态下，JVM 不响应 SIGHUP 信号（kill -3），我们无法使用 jstack 正常连接到它。我们可以用“jstack -F”连接，但输出是狡猾的（我们可以看到很多来自 JStack 的 NullPointerExceptions 显然是因为它不能“走”一些堆栈）。所以“jstack -F”输出似乎没用。

We have run a stack dump from "gdb" though, and we were able to match the thread id that spins the CPU (we found that using "top" with a per-thread view - "H" option) with a thread stack that appears in the gdb result and this is how it looks like:

不过，我们已经从“gdb”运行了一个堆栈转储，并且我们能够将旋转 CPU 的线程 ID（我们发现将“top”与每个线程视图一起使用 - “H”选项）与一个线程堆栈相匹配出现在 gdb 结果中，这是它的样子：

Thread 443 (Thread 0x7e5b90 (LWP 26310)):
#0  0x0115ebd3 in CompactibleFreeListSpace::block_size(HeapWord const*) const () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#1  0x01160ff9 in CompactibleFreeListSpace::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#2  0x0123456c in Generation::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#3  0x01229b2c in GenCollectedHeap::prepare_for_compaction() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#4  0x0122a7fc in GenMarkSweep::invoke_at_safepoint(int, ReferenceProcessor*, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#5  0x01186024 in CMSCollector::do_compaction_work(bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#6  0x011859ee in CMSCollector::acquire_control_and_collect(bool, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#7  0x01185705 in ConcurrentMarkSweepGeneration::collect(bool, bool, unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#8  0x01227f53 in GenCollectedHeap::do_collection(bool, bool, unsigned int, bool, int) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#9  0x0115c7b5 in GenCollectorPolicy::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#10 0x0122859c in GenCollectedHeap::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#11 0x0158a8ce in VM_GenCollectForAllocation::doit() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#12 0x015987e6 in VM_Operation::evaluate() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#13 0x01597c93 in VMThread::evaluate_operation(VM_Operation*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#14 0x01597f0f in VMThread::loop() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#15 0x015979f0 in VMThread::run() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#16 0x0145c24e in java_start(Thread*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#17 0x00ccd46b in start_thread () from /lib/libpthread.so.0
#18 0x00bc2dbe in clone () from /lib/libc.so.6

It seems that a JVM thread is spinning while doing some CMS related work. We have checked the memory usage on the box, there seems to be enough memory available and the system is not swapping. Has anyone come across such a situation? Does it look like a JVM bug?

似乎 JVM 线程在执行一些 CMS 相关工作时正在旋转。我们检查了盒子上的内存使用情况，似乎有足够的可用内存并且系统没有交换。有没有人遇到过这样的情况？它看起来像 JVM 错误吗？

UPDATE

更新

I've obtained some more information about this problem (it happened again on a server that has been running for more than 7 days). When the JVM entered the "hung" state it stayed like that for 2 hours until the server was manually restarted. We have obtained a core dump of the process and the gc log. We tried to get a heap dump as well, but "jmap" failed. We tried to use jmap -F but then only a 4Mb file was written before the program aborted with an exception (something about the a memory location not being accessible). So far I think the most interesting information comes from the gc log. It seems that the GC logging stopped as well (possibly at the time when the VM thread went into the long loop):

我已获得有关此问题的更多信息（它再次发生在已运行 7 天以上的服务器上）。当 JVM 进入“挂起”状态时，它会保持 2 小时，直到手动重新启动服务器。我们已经获得了进程的核心转储和 gc 日志。我们也尝试获得堆转储，但“jmap”失败了。我们尝试使用 jmap -F 但在程序因异常中止之前只写入了一个 4Mb 的文件（关于内存位置无法访问的问题）。到目前为止我认为最有趣的信息来自 gc 日志。似乎 GC 日志记录也停止了（可能是在 VM 线程进入长循环时）：

657501.199: [Full GC (System) 657501.199: [CMS: 400352K->313412K(524288K), 2.4024120 secs] 660634K->313412K(878208K), [CMS Perm : 29455K->29320K(68568K)], 2.4026470 secs] [Times: user=2.39 sys=0.01, real=2.40 secs] 
657513.941: [GC 657513.941: [ParNew: 314624K->13999K(353920K), 0.0228180 secs] 628036K->327412K(878208K), 0.0230510 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657523.772: [GC 657523.772: [ParNew: 328623K->17110K(353920K), 0.0244910 secs] 642036K->330523K(878208K), 0.0247140 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657535.473: [GC 657535.473: [ParNew: 331734K->20282K(353920K), 0.0259480 secs] 645147K->333695K(878208K), 0.0261670 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] 
....
....
688346.765: [GC [1 CMS-initial-mark: 485248K(524288K)] 515694K(878208K), 0.0343730 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
688346.800: [CMS-concurrent-mark-start]
688347.964: [CMS-concurrent-mark: 1.083/1.164 secs] [Times: user=2.52 sys=0.09, real=1.16 secs] 
688347.964: [CMS-concurrent-preclean-start]
688347.969: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
688347.969: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 688352.986: [CMS-concurrent-abortable-preclean: 2.351/5.017 secs] [Times: user=3.83 sys=0.38, real=5.01 secs] 
688352.987: [GC[YG occupancy: 297806 K (353920 K)]688352.987: [Rescan (parallel) , 0.1815250 secs]688353.169: [weak refs processing, 0.0312660 secs] [1 CMS-remark: 485248K(524288K)] 783055K(878208K), 0.2131580 secs] [Times: user=1.13 sys
=0.00, real=0.22 secs] 
688353.201: [CMS-concurrent-sweep-start]
688353.903: [CMS-concurrent-sweep: 0.660/0.702 secs] [Times: user=0.91 sys=0.07, real=0.70 secs] 
688353.903: [CMS-concurrent-reset-start]
688353.912: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
688354.243: [GC 688354.243: [ParNew: 344928K->30151K(353920K), 0.0305020 secs] 681955K->368044K(878208K), 0.0308880 secs] [Times: user=0.15 sys=0.00, real=0.03 secs]
....
....
688943.029: [GC 688943.029: [ParNew: 336531K->17143K(353920K), 0.0237360 secs] 813250K->494327K(878208K), 0.0241260 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 
688950.620: [GC 688950.620: [ParNew: 331767K->22442K(353920K), 0.0344110 secs] 808951K->499996K(878208K), 0.0347690 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] 
688956.596: [GC 688956.596: [ParNew: 337064K->37809K(353920K), 0.0488170 secs] 814618K->515896K(878208K), 0.0491550 secs] [Times: user=0.18 sys=0.04, real=0.05 secs] 
688961.470: [GC 688961.471: [ParNew (promotion failed): 352433K->332183K(353920K), 0.1862520 secs]688961.657: [CMS

I suspect this problem has something to do with the last line in the log (I've added some "...." in order to skip some lines that were not interesting). The fact that the server stayed in the hung state for 2 hours (probably trying to GC and compact the old generation) seems quite strange to me. Also, the gc log stops suddenly with that message and nothing else gets printed any more, probably because the VM Thread gets into some sort of infinite loop (or something that takes 2+ hours).

我怀疑这个问题与日志中的最后一行有关（我添加了一些“....”以跳过一些不感兴趣的行）。服务器在挂起状态下保持 2 小时（可能是尝试 GC 并压缩老年代）这一事实对我来说似乎很奇怪。此外，gc 日志突然停止并显示该消息，并且不再打印任何其他内容，可能是因为 VM 线程进入某种无限循环（或需要 2 个多小时的循环）。

Answer 1

回答by Stephen C

It looks like a GC tuning problem, possibly triggered by a memory leak. I suggest that you add the JVM parameter that turns on GC logging, and see if that gives you any insights.

它看起来像一个 GC 调优问题，可能是由内存泄漏触发的。我建议您添加打开 GC 日志记录的 JVM 参数，看看这是否能给您任何见解。

Does it look like a JVM bug?

它看起来像 JVM 错误吗？

Not to me.

不给我。

When you run out of heap space (for example, as a result of a storage leak) it is common to see the GC running more and more frequently and using a larger and larger percentage of the CPU. There are some JVM parameters that you can set to avoid this, but they have the effect of causing your JVM to keel over sooner with an OOME. (This may sound strange, but it in many cases this is a good thing because it reduces the time until your system is restarted and is able to run normally again.)

当堆空间用完时（例如，由于存储泄漏），通常会看到 GC 越来越频繁地运行并使用越来越大的 CPU 百分比。您可以设置一些 JVM 参数来避免这种情况，但它们的作用是使您的 JVM 因 OOME 更快崩溃。（这听起来可能很奇怪，但在许多情况下这是一件好事，因为它减少了系统重新启动并能够再次正常运行的时间。）

Answer 2

回答by Sunil

I am also having this problem on a Linux 64bit Fedora15. I am running tomcat using JSVC on port 80. And some times jvm hangs taking 100% CPU. But when I run

我在 Linux 64 位 Fedora15 上也遇到了这个问题。我在端口 80 上使用 JSVC 运行 tomcat。有时 jvm 挂起占用 100% CPU。但是当我跑

jmap -F -histo [tomcat-pID]

everything goes back to normal.

一切恢复正常。

I read http://mail-archives.apache.org/mod_mbox/tomcat-users/200305.mbox/%[email protected]%3E

我读了http://mail-archives.apache.org/mod_mbox/tomcat-users/200305.mbox/%[email protected]%3E

But this issue started happening all of a sudden without any change in the codebase.

但是这个问题突然开始发生，代码库没有任何变化。

java 挂起 JVM 消耗 100% CPU

提问by Bogdan

回答by Stephen C

回答by Sunil

相关推荐

最近更新

标签

java 挂起 JVM 消耗 100% CPU

提问by Bogdan

回答by Stephen C

回答by Sunil

相关推荐

java Java服务器自签名证书+客户端证书和SSL - 连接重置

java Java选择文件位置

java 使用套接字发送和接收字节[]

需要使用 java 将 EMF 转换为 jpeg,png 文件格式

相关推荐

最近更新

标签