java Erlang 的让它崩溃的哲学——适用于其他地方吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4393197/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 06:09:55  来源:igfitidea点击:

Erlang's let-it-crash philosophy - applicable elsewhere?

java.neterlangdefensive-programming

提问by Andrew Matthews

Erlang's (or Joe Armstrong's?) advice NOT to use defensive programmingand to let processes crash (rather than pollute your code with needless guards trying to keep track of the wreckage) makes so much sense to me now that I wonder why I wasted so much effort on error handling over the years!

Erlang 的(或 Joe Armstrong 的?)建议不要使用防御性编程并让进程崩溃(而不是用不必要的守卫试图跟踪残骸来污染你的代码)对我来说非常有意义,现在我想知道为什么我浪费了这么多多年来在错误处理上的努力!

What I wonder is - is this approach only applicable to platforms like Erlang? Erlang has a VM with simple native support for process supervision trees and restarting processes is reallyfast. Should I spend my development efforts (when not in the Erlang world) on recreating supervision trees rather than bogging myself down with top-level exception handlers, error codes, null results etc etc etc.

我想知道的是 - 这种方法是否只适用于 Erlang 这样的平台?Erlang 有一个 VM,它对进程监督树提供简单的原生支持,并且重新启动进程非常快。我是否应该将我的开发工作(当不在 Erlang 世界中时)花在重新创建监督树上,而不是让自己陷入顶级异常处理程序、错误代码、空结果等等等。

Do you think this change of approach would work well in (say) the .NET or Java space?

您认为这种方法的改变在(比如).NET 或 Java 领域中是否适用?

采纳答案by Craig Stuntz

It's applicable everywhere. Whether or not you write your software in a "let it crash" pattern, it will crash anyway, e.g., when hardware fails. "Let it crash" applies anywhere where you need to withstand reality. Quoth James Hamilton:

它适用于任何地方。无论您是否以“让它崩溃”的模式编写软件,它无论如何都会崩溃,例如,当硬件出现故障时。“让它崩溃”适用于您需要承受现实的任何地方。引用詹姆斯·汉密尔顿:

If a hardware failure requires any immediate administrative action, the service simply won't scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren't frequently used, they won't work when needed.

如果硬件故障需要立即采取任何管理措施,则该服务将无法经济高效且可靠地扩展。整个服务必须能够在没有人工管理交互的情况下幸免于难。故障恢复必须是一个非常简单的路径,并且必须经常测试该路径。斯坦福大学的 Armando Fox 认为,测试故障路径的最佳方法是永远不要正常关闭服务。只是硬失败。这听起来违反直觉,但如果不经常使用故障路径,它们在需要时将无法工作。

This doesn't precisely mean "never use guards," though. But don't be afraid to crash!

但这并不意味着“永远不要使用守卫”。但不要害怕崩溃!

回答by rvirding

Yes, it is applicable everywhere, but it is important to note in which context it is meant to be used. It does notmean that the application as a whole crashes which, as @PeterM pointed out, can be catastrophic in many cases. The goal is to build a system which as a whole never crashes but can handle errors internally. In our case it was telecomms systems which are expected to have downtimes in the order of minutes per year.

是的,它适用于任何地方,但重要的是要注意它的使用上下文。它并不意味着该应用程序作为一个整体崩溃,正如@PeterM指出的那样,可以在许多情况下是灾难性的。目标是构建一个整体上永远不会崩溃但可以在内部处理错误的系统。在我们的案例中,电信系统预计每年会出现几分钟的停机时间。

The basic design is to layer the system and isolate central parts of the system to monitor and control the other parts which do the work. In OTP terminology we have supervisorand workerprocesses. Supervisors have the job of monitoring the workers, and other supervisors, with the goal of restarting them in the correct way when they crash while the workers do all the actual work. Structuring the system properly in layers using this principle of strictly separating the functionality allows you to isolate most of the error handling out of the workers into the supervisors. You try to end up with a smallfail-safe error kernel, which if correct can handle errors anywhere in the rest of the system. It is in this context where the "let-it-crash" philosophy is meant to be used.

基本设计是将系统分层并隔离系统的中心部分,以监视和控制执行工作的其他部分。在 OTP 术语中,我们有主管工作进程。主管的职责是监控工人和其他主管,目的是在工人执行所有实际工作的同时,在他们崩溃时以正确的方式重新启动它们。使用严格分离功能的原则按层正确构建系统,您可以将大部分错误处理从工作人员中隔离到主管中。你试图以一个小的结束故障安全错误内核,如果正确,可以处理系统其余部分中的任何错误。正是在这种情况下,“让它崩溃”的哲学才应运而生。

You get the paradox of where you are thinking about errors and failures everywhere with the goal of actually handling them in as few places as possible.

你会遇到一个悖论,你在哪里考虑错误和失败,目的是在尽可能少的地方实际处理它们。

The best approach to handle an error depends of course on the error and the system. Sometimes it is best to try and catch errors locally within a process and trying to handle them there, with the option of failing again if that doesn't work. If you have a number of worker processes cooperating then it is often best to crash them all and restart them again. It is a supervisor which does this.

处理错误的最佳方法当然取决于错误和系统。有时最好尝试在进程中本地捕获错误并尝试在那里处理它们,如果这不起作用,可以选择再次失败。如果您有多个工作进程协作,那么通常最好将它们全部崩溃并重新启动它们。这是执行此操作的主管。

You do need a language which generates errors/exceptions when something goes wrong so you can trap them or have them crash the process. Just ignoring error return values is not the same thing.

您确实需要一种在出现问题时生成错误/异常的语言,以便您可以捕获它们或让它们使进程崩溃。仅仅忽略错误返回值不是一回事。

回答by Christoph Woskowski

My colleagues and myself thought about the topic not especially technology wise but more from a domain perspective and with a safety focus.

我的同事和我自己并不是特别从技术角度考虑这个话题,而是更多地从领域角度和安全重点考虑。

The question is "Is it safe to let it crash?" or better "Is it even possible to apply a robustness paradigm like Erlang's “let it crash” to safety-related software projects?".

问题是“让它崩溃安全吗?” 或者更好的“是否有可能将像 Erlang 的“让它崩溃”这样的健壮性范式应用到与安全相关的软件项目中?”。

In order to find an answer we did a small research project using a close-to-reality scenario with industrial and especially medical background. Take a look here (http://bit.ly/Z-Blog_let-it-crash). There is even a paper for download. Tell me what you think!

为了找到答案,我们使用具有工业特别是医学背景的接近现实的场景进行了一个小型研究项目。看看这里 ( http://bit.ly/Z-Blog_let-it-crash)。甚至还有论文可供下载。告诉我你的想法!

Personally I think it is applicable in many cases and even desirable, especially when there is a lot of error handling to do (safety-related systems). You cannot always use Erlang (missing real time features, no real embedded support, costumer whishes ...), but I'm pretty sure you can implement it otherwise (e.g. using threads, exceptions, message passing). I haven't tried it yet though, but I'd like to.

我个人认为它适用于许多情况甚至是可取的,尤其是当有很多错误处理要做时(安全相关系统)。你不能总是使用 Erlang(缺少实时功能,没有真正的嵌入式支持,客户希望......),但我很确定你可以以其他方式实现它(例如使用线程、异常、消息传递)。虽然我还没有尝试过,但我很想。

回答by Peter M

I write programs that rely on data from real world situations and if they crash they can cause big $$ in physical damage (not to mention big $$ in lost revenue). I would be out of a job in a flash if I did not program defensively.

我编写的程序依赖于来自现实世界情况的数据,如果它们崩溃,它们可能会造成巨大的物理损失(更不用说巨大的收入损失)。如果我不进行防御性编程,我会很快失业。

With that said I think that Erlang must be a special case that not only can you restart things instantly, that a restarted program can pop up, look around and say "ahhh .. that was what I was doing!"

话虽如此,我认为 Erlang 必须是一个特例,它不仅可以立即重新启动,而且可以弹出重新启动的程序,环顾四周并说“啊……这就是我正在做的!”

回答by Edwin Buck

It is called fail-fast. It's a good paradigm provided you have a team of people who can respond to the failure (and do so quickly).

它被称为快速失败。这是一个很好的范例,前提是您有一个可以对失败做出响应(并且响应迅速)的团队。

In the NAVY all pipes and electrical is mounted on the exterior of a wall (preferably on the more public side of a wall). That way, if there is a leak or issue, it is more likely to be detected quickly. In the NAVY, people are punished for not responding to a failure, so it works very well: failures are detected quickly and acted upon quickly.

在 NAVY 中,所有管道和电气设备都安装在墙的外部(最好是在墙的更公共的一侧)。这样,如果存在泄漏或问题,就更有可能被快速检测到。在 NAVY 中,人们因不对故障做出反应而受到惩罚,因此它运作良好:故障被快速检测并迅速采取行动。

In a scenario where someone cannot act on a failure quickly, it becomes a matter of opinion whether it is more beneficial to allow the failure to stop the system or to swallow the failure and attempt to continue onward.

在某人无法对故障迅速采取行动的情况下,允许故障停止系统还是吞下故障并尝试继续前进,这将成为一个见仁见智的问题。

回答by Peter Lawrey

IMHO Some developers handle/wrap checked exceptions with code which add little value. It is often simpler to allow a method to throw the original exception unless you are going to handle it and add some value.

恕我直言,一些开发人员使用几乎没有增加价值的代码处理/包装检查过的异常。允许方法抛出原始异常通常更简单,除非您要处理它并添加一些值。

回答by Henry H.

Yes, even in economy, see this article: https://www.nytimes.com/2020/04/16/upshot/world-economy-restructuring-.html. The World became a "spaghetti code" and is suffering a "Global State" issue.

是的,即使在经济方面,请参阅这篇文章:https: //www.nytimes.com/2020/04/16/upshot/world-economy-restructuring-.html。世界变成了“意大利面条式代码”,并且正在遭受“全局状态”问题。