Java 调试 JBoss 100% CPU 使用率

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2449776/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 07:51:09  来源:igfitidea点击:

debugging JBoss 100% CPU usage

javadebuggingjbossweb-applicationscpu-usage

提问by NateS

Originally posted on Server Fault, where it was suggested this question might better asked here.

最初发布在 Server Fault 上,有人建议在这里问这个问题可能更好。

We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

我们正在使用 JBoss 来运行我们的两个 WAR。一个是我们的网络应用程序,另一个是我们的网络服务。Web 应用程序访问另一台计算机上的数据库并向 Web 服务发出请求。Web 服务向其他机器发出 JMS 请求,聚合数据并返回它。

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

在我们最大的客户中,JBoss Java 进程大约每月一次占用所有 CPU 的 100%。运行 JBoss 的机器有 8 个 CPU。在此期间,我们的网络应用程序仍可访问,但页面加载需要大约 3 分钟。重新启动 JBoss 会恢复一切正常。

The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

数据库机器和所有其他机器都很好,只有运行 JBoss 的机器受到影响。内存使用正常。网络利用率正常。JBoss 日志中没有可疑的错误消息。

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

我已经建立了一个尽可能接近客户端生产环境的测试环境,并且我已经用多达 2 倍的并发用户数进行了负载测试。我还没有让我的测试环境来复制这个问题。

Where do we go from here? How can we narrow down the problem?

我们从这里去哪里?我们如何缩小问题的范围?

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

目前我们唯一的计划是等到问题在生产中自行出现,然后进行一些调试以确定原因。到目前为止,当问题发生时,人们只是重新启动了 JBoss,以最大限度地减少停机时间。下次发生时,他们会让开发人员查看一下。问题是,下次发生这种情况时,可以做些什么来确定原因?

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

我们可以在同一个机器上设置一个单独的 JBoss 实例,并与 Web 服务分开安装 Web 应用程序。这样当问题下一次出现时,我们就会知道哪个 WAR 有问题(假设它是我们的代码)。但这并没有缩小多少范围。

Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

我应该启用 JMX 远程吗?这样下次出现问题时,我可以连接 VisualVM 并查看哪些线程正在占用 CPU 以及它们到底在做什么。但是,在生产环境中启用 JMX 远程是否有明显的缺点?

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

有没有另一种方法可以查看哪些线程正在占用 CPU 并获取堆栈跟踪以查看它们在做什么?

Any other ideas?

还有其他想法吗?

Thanks!

谢谢!

采纳答案by Alexander Torstling

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.

我认为您绝对应该尝试使用一些负载测试来设置测试环境,以便重现您的问题。为了查明问题,分析肯定会有所帮助。

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.

一个快速的解决方法是下次用 kill -3 杀死 jboss 以获得转储进行分析。我要检查的第二件事是您正在使用 -server 标志运行并且您的 gc 设置正常。您也可以运行一些 dstat 来查看进程在锁定期间正在做什么。但同样 - 设置负载测试环境(通过 EC2 左右)来重现这一点可能更安全。

回答by krosenvold

This typically happens with runaway code or unsafe thread access to hashmaps. A simple thread dump (kill -3, as @disown says, or ctrl-break in a windows console) will reveal this problem.

这通常发生在失控的代码或对哈希图的不安全线程访问中。一个简单的线程转储(kill -3,如@disown 所说,或在 Windows 控制台中按 ctrl-break)将揭示这个问题。

Since you're unable to reproduce it using tests I think it smells like a concurrency issue; it's usually hard to make test scripts behave sufficiently random to catch issues of this type.

由于您无法使用测试重现它,我认为它闻起来像是并发问题;通常很难使测试脚本的行为足够随机以捕获此类问题。

I normally try to make it standard operating procedure to do thread-dumps of anyJVM that is restarted due to operational anomalies, and it's really a requirement to catch those once-a-month things.

我通常会尝试将任何因操作异常而重新启动的 JVM 的线程转储设为标准操作程序,并且确实需要每月捕获一次这些内容。

回答by skaffman

There's a quick and dirty way of identifying which threads are using up the CPU time on JBoss. Go the the JMX Console with a browser (usually on http://localhost:8080/jmx-console, but may be different for you), look for a bean called ServerInfo, it has an operation called listThreadCpuUtilizationwhich dumps the actual CPU time used by each active thread, in a nice tabular format. If there's one misbehaving, it usually stands out like a sore thumb.

有一种快速而肮脏的方法可以识别哪些线程正在占用 JBoss 上的 CPU 时间。使用浏览器访问 JMX 控制台(通常在http://localhost:8080/jmx-console 上,但可能因您而异),查找名为 的 bean ServerInfo,它有一个名为 的操作listThreadCpuUtilization,用于转储由每个活动线程,以漂亮的表格格式。如果有一个行为不端,它通常会像拇指酸痛一样突出。

There's also the listThreadDumpoperation which dumps the stack for every thread to the browser.

还有listThreadDump将每个线程的堆栈转储到浏览器的操作。

Not as good as a profiler, but a much easier way to get the basic information. For production servers, where it's often bad news to connect a profiler, it's very handy.

不如分析器好,但可以更轻松地获取基本信息。对于生产服务器,连接分析器通常是个坏消息,它非常方便。

回答by Sabahat Theem

If you are using JBoss 5.1.0 EAP, there is a bug in Jboss and they also have a fix. Here is the URL: https://issues.jboss.org/browse/JBPAPP-5193

如果您使用的是 JBoss 5.1.0 EAP,则 Jboss 中存在一个错误,并且他们也有一个修复程序。这是网址:https: //issues.jboss.org/browse/JBPAPP-5193