Java JVM 能否在不重启的情况下从 OutOfMemoryError 中恢复

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3058198/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 16:04:22  来源:igfitidea点击:

Can the JVM recover from an OutOfMemoryError without a restart

javajvmout-of-memory

提问by sengs

  1. Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?

  2. Do the various JVM implementations differ in this aspect?

  1. 如果 JVM 有机会在更多对象分配请求到来之前运行 GC,是否可以在不重启的情况下从 OutOfMemoryError 中恢复?

  2. 各种 JVM 实现在这方面是否有所不同?

My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I haveto restart it? Or can I let it run if further requests seem to work without a problem.

我的问题是关于 JVM 恢复,而不是用户程序试图通过捕获错误来恢复。换句话说,如果在应用程序服务器 (jboss/websphere/..) 中抛出 OOME,我是否必须重新启动它?或者如果进一步的请求似乎没有问题,我可以让它运行。

采纳答案by Stephen C

It may work, but it is generally a bad idea. There is no guarantee that your application will succeedin recovering, or that it will know if it has not succeeded. For example:

它可能有效,但通常是个坏主意。无法保证您的应用程序会成功恢复,或者它会知道它是否成功。例如:

  • There really may be notenough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.

  • The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.

  • If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.

  • Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.

  • 确实可能没有足够的内存来执行请求的任务,即使在采取了诸如释放保留内存块之类的恢复步骤之后也是如此。在这种情况下,您的应用程序可能会陷入循环中,它反复出现恢复,然后再次耗尽内存。

  • 可以在任何线程上抛出 OOME。如果应用程序线程或库不是设计用来处理它的,这可能会使一些长期存在的数据结构处于不完整或不一致的状态。

  • 如果线程因 OOME 而死亡,作为 OOME 恢复的一部分,应用程序可能需要重新启动它们。至少,这会使应用程序变得更加复杂。

  • 假设一个线程使用通知/等待或一些更高级别的机制与其他线程同步。如果该线程因 OOME 死亡,则其他线程可能会等待永远不会到来的通知(等)……例如。为此进行设计可能会使应用程序变得更加复杂。

In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.

总之,设计、实现和测试应用程序以从 OOME 中恢复可能很困难,特别是如果应用程序(或它运行的框架,或它使用的任何库)是多线程的。将 OOME 视为致命错误是一个更好的主意。

See also my answerto a related question:

另请参阅对相关问题的回答:

EDIT- in response to this followup question:

编辑- 回应这个后续问题:

In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have torestart it?

换句话说,如果在应用程序服务器 (jboss/websphere/..) 中抛出 OOME,我是否必须重新启动它?

No you don't have torestart. But it is probably wiseto, especially if you don't have a good / automated way of checking that the service is running correctly.

不,你不必须重新启动。但这可能是明智的,特别是如果您没有一种好的/自动化的方法来检查服务是否正确运行。

The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are notdesigned to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)

JVM 会恢复得很好。但是应用程序服务器和应用程序本身可能会也可能不会恢复,这取决于它们被设计为如何处理这种情况。(我的经验是,一些应用服务器是不是设计来解决这个问题,并设计和实施一个复杂的应用程序从OOMEs恢复是很难的,并作适当测试就更难。)

EDIT 2

编辑 2

In response to this comment:

对此评论的回应:

"other threads may be left waiting for notifies (etc) that never come"Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?

“其他线程可能会等待永远不会到来的通知(等)”真的吗?被杀死的线程不会解开它的堆栈,在它运行时释放资源,包括持有的锁吗?

Yes really! Consider this:

对真的!考虑一下:

Thread #1 runs this:

线程 #1 运行这个:

    synchronized(lock) {
         while (!someCondition) {
             lock.wait();
         }
    }
    // ...

Thread #2 runs this:

线程 #2 运行这个:

    synchronized(lock) {
         // do stuff
         lock.notify();
    }

If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do somethingsection, then Thread #2 won't make the notify()call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lockobject ... but that is not sufficient!

如果线程 #1 正在等待通知,并且线程 #2 在该// do something部分中获得 OOME ,则线程 #2 将不会进行notify()调用,并且线程 #1 可能会永远卡住等待永远不会发生的通知. 当然,线程 #2 保证会释放lock对象上的互斥锁……但这还不够!

If not the code ran by the thread is not exception safe, which is a more general problem.

如果不是,线程运行的代码不是异常安全的,这是一个更普遍的问题。

"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.

“异常安全”不是我听说过的术语(虽然我知道你的意思)。Java 程序通常不会设计为对意外异常具有弹性。实际上,在上述场景中,使应用程序异常安全很可能介于困难和不可能之间。

You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!

您需要某种机制,借此将线程 #1 的失败(由于 OOME)转变为线程 #2 的线程间通信失败通知。Erlang 这样做……但不是 Java。他们可以在 Erlang 中这样做的原因是 Erlang 进程使用严格的类似 CSP 的原语进行通信;即没有共享数据结构!

(Note that you could get the above problem for just about any unexpectedexception ... not just Errorexceptions. There are certain kinds of Java code where attempting to recover from an unexpectedexception is likely to end badly.)

(请注意,您可能会遇到任何意外异常的上述问题......不仅仅是Error异常。在某些类型的 Java 代码中,尝试从意外异常中恢复很可能会以糟糕的方式结束。)

回答by BalusC

The JVM willrun the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.

在JVM时,它的上边缘运行GC OutOfMemoryError。如果 GC 根本没有帮助,那么 JVM 将抛出 OOME。

You canhowever catchit and if necessary take an alternative path. Any allocations inside the tryblock will be GC'ed.

但是,您可以catch这样做,并在必要时采用其他路径。try块内的任何分配都将被 GC 处理。

Since the OOME is "just" an Errorwhich you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.

由于 OOME 是“只是”Error你可以catch,我希望不同的 JVM 实现表现相同。我至少可以从经验中确认上述情况适用于 Sun JVM。

See also:

也可以看看:

回答by Amir Afghani

You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're morelikely to have some memory to use in your recovery sequence.

您可以增加从这种情况中恢复的几率,尽管不建议您尝试。你所做的是在启动时预先分配一些固定数量的内存,专门用于你的恢复工作,当你捕捉到 OOM 时,将这个预先分配的引用归零,你更有可能在你的内存中使用一些内存恢复顺序。

I don't know about different JVM implementations.

我不知道不同的 JVM 实现。

回答by Yishai

Any sane JVM will throw an OutOfMemoryError only if there is nothing the Garbage collector can do. However, if you catch the OutOfMemoryError early enough on the stack frame it can be likely enough that the cause was itself became unreachable and was garbage collected (unless the problem is not in the current thread).

只有当垃圾收集器无能为力时,任何正常的 JVM 才会抛出 OutOfMemoryError。但是,如果您在堆栈帧上足够早地捕获 OutOfMemoryError,则很可能导致原因本身变得无法访问并被垃圾收集(除非问题不在当前线程中)。

Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense (as long as it can reasonably release the third-party code), but otherwise, in the general case, recovery should probably consist of bailing and telling the user why, rather than trying to go on as if nothing happened.

通常,运行其他代码的框架,如应用程序服务器,在面对 OME 时尝试继续是有意义的(只要它可以合理地释放第三方代码),但除此之外,在一般情况下,恢复可能应该包括保释并告诉用户原因,而不是试图继续,好像什么也没发生。

To answer your newly updated question: There is no reason to think you need to shut down the server if all is working well. My experience with JBoss is that as long as the OME didn't affect a deployment, things work fine. Sometimes JBoss runs out of permgen space if you do a lot of hot deployment. Then indeed the situation is hopeless and an immediate restart (which will have to be forced with a kill) is inevitable.

回答您新更新的问题:如果一切正常,没有理由认为您需要关闭服务器。我对 JBoss 的经验是,只要 OME 不影响部署,一切都会正常进行。如果您进行大量热部署,有时 JBoss 会耗尽 permgen 空间。那么情况确实是无望的,立即重启(这将不得不用杀戮来强制)是不可避免的。

Of course each app server (and deployment scenario) will vary and it is really something learned from experience in each case.

当然,每个应用服务器(和部署方案)都会有所不同,这确实是从每种情况下的经验中学到的东西。

回答by JUST MY correct OPINION

Canit recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...

恢复吗?可能。任何编写良好的 JVM 只会在它尽其所能回收足够的内存来执行您告诉它执行的操作后才会抛出 OOME。很有可能这意味着您无法恢复。但...

It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...

这取决于很多事情。例如,如果垃圾收集器不是复制收集器,则“内存不足”条件实际上可能是“没有足够大的块可以分配”。展开堆栈的行为可能会在稍后的 GC 轮中清理对象,从而为您的目的留下足够大的开放块。在这种情况下,您可以重新启动。结果可能值得至少重试一次。但...

You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.

你可能不想依赖这个。如果您经常收到 OOME,您最好检查一下您的服务器并找出发生了什么以及为什么。也许您必须清理您的代码(您可能会泄漏或创建过多的临时对象)。也许您在调用 JVM 时必须提高内存上限。将 OOME(即使它是可恢复的)视为代码中某个地方出现了一些不好的东西并采取相应行动的标志。也许您的服务器不必立即停机,但您必须在陷入更深层次的麻烦之前修复某些问题。

回答by Adam Crume

I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.

我会说这部分取决于导致 OutOfMemoryError 的原因。如果 JVM 确实内存不足,最好重新启动它,并在可能的情况下使用更多内存(或更高效的应用程序)。但是,我已经看到相当多的 OOME 是由分配 2GB 阵列等引起的。在这种情况下,如果它类似于 J2EE Web 应用程序,则错误的影响应该仅限于该特定应用程序,并且 JVM 范围内的重新启动不会有任何好处。

回答by i000174

You cannot fully a JVM that had OutOfMemoryError. At least with the oracle JVM you can add -XX:OnOutOfMemoryError="cmd args;cmd args"and take recovery actions, like kill the JVM or send the event somewhere.

你不能完全是一个有 OutOfMemoryError 的 JVM。至少使用 oracle JVM,您可以添加-XX:OnOutOfMemoryError="cmd args;cmd args"并执行恢复操作,例如终止 JVM 或将事件发送到某处。

Reference: https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

参考:https: //www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html