C++ 编译用于高放射性环境的应用程序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36827659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compiling an application for use in highly radioactive environments
提问by rook
We are compiling an embedded C/C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and crashes more often than we would like. The hardware is designed for this environment, and our application has run on this platform for several years.
我们正在编译一个嵌入式 C/C++ 应用程序,该应用程序部署在受到电离辐射轰击的环境中的屏蔽设备中。我们正在为 ARM 使用 GCC 和交叉编译。部署后,我们的应用程序会生成一些错误的数据,并且比我们希望的更频繁地崩溃。硬件是为这种环境设计的,我们的应用程序已经在这个平台上运行了好几年。
Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errorsand memory-corruption caused by single event upsets? Have any other developers had success in reducing the harmful effects of soft errors on a long-running application?
我们是否可以对代码进行更改,或者可以进行编译时改进以识别/纠正由单个事件扰乱引起的软错误和内存损坏?是否有其他开发人员成功地减少了软错误对长时间运行的应用程序的有害影响?
采纳答案by Ian
Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.
从事小型卫星软件/固件开发和环境测试工作约4-5年*,我想在这里分享我的经验。
*(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)
*(小型卫星由于其电子元件相对较小且尺寸有限,因此比较大的卫星更容易出现单一事件扰动)
To be very concise and direct: there is no mechanism to recover from detectable, erroneous situationby the software/firmware itself without, at least, one copyof minimum working versionof the software/firmware somewherefor recoverypurpose - and with the hardware supporting the recovery(functional).
非常简洁和直接:没有机制歇着检测,错误的情况由软件/固件本身没有,至少,一个 复制的最低工作版本的软件/固件的某处为恢复目的-和与硬件配套恢复(功能)。
Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.
现在,这种情况通常是在硬件和软件层面上处理的。在这里,应您的要求,我将分享我们在软件层面可以做的事情。
...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-havefeature for any software/firmware in highly ionized environment. Without this, you couldhave redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!
...minimum working version...Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have onlythe following two or three features:
- capable of listening to command from external system,
- capable of updating the current software/firmware,
- capable of monitoring the basic operation's housekeeping data.
...copy... somewhere...Have redundant software/firmware somewhere.
You could, with orwithout redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresseswhich sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).
Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannoteliminate allsingle point of failures. At the very least, you will still have onesingle point of failure, which is the switch itself(or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point withoutadditional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.
But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).
- You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
...detectable erroneous situation..The error must be detectable, usually by the hardware error correction/detection circuitor by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independentfrom the main software/firmware. Its main task is onlyfor checking/correcting. If the hardware circuit/firmware is reliable(such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.
...hardware supporting the recoveryNow, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at leastfunctional. If the hardware is permanently broken (normally happen after its Total ionizing dosereaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).
...恢复目的...。提供在真实环境中更新/重新编译/刷新您的软件/固件的能力。对于高度电离环境中的任何软件/固件,这几乎是必备功能。没有这个,您可以拥有任意数量的冗余软件/硬件,但在某一时刻,它们都会爆炸。所以,准备这个功能!
...最低工作版本...在您的代码中具有响应式、多个副本、最低版本的软件/固件。这就像 Windows 中的安全模式。与其只有一个功能齐全的软件版本,不如拥有软件/固件最低版本的多个副本。最小副本的大小通常比完整副本小得多,并且几乎总是只有以下两个或三个功能:
- 能够听取来自外部系统的命令,
- 能够更新当前的软件/固件,
- 能够监控基本操作的内务数据。
...复制...某处...在某处有冗余软件/固件。
无论有没有冗余硬件,您都可以尝试在您的 ARM uC 中安装冗余软件/固件。这通常是通过在单独的地址中放置两个或多个相同的软件/固件来完成的,这些软件/固件相互发送心跳——但一次只有一个处于活动状态。如果已知一个或多个软件/固件无响应,请切换到其他软件/固件。使用这种方法的好处是我们可以在发生错误后立即进行功能替换——无需与任何负责检测和修复错误的外部系统/方联系(在卫星情况下,通常是任务控制中心( MCC))。
严格来说,没有冗余硬件,这样做的缺点是实际上无法消除所有的单点故障。至少,您仍然会有一个单点故障,即开关本身(或通常是代码的开头)。尽管如此,对于高度电离环境中受尺寸限制的设备(例如微微/毫微微卫星),在没有额外硬件的情况下将单点故障减少到一个点仍然值得考虑。此外,用于切换的代码段肯定会比整个程序的代码少得多——显着降低了在其中获取单个事件的风险。
但是如果你不这样做,你的外部系统中应该至少有一个副本,它可以与设备联系并更新软件/固件(在卫星情况下,它又是任务控制中心)。
- 您还可以将副本保存在设备的永久内存中,可以触发该副本以恢复正在运行的系统的软件/固件
...可检测的错误情况..错误必须是可检测的,通常通过硬件纠错/检测电路或通过一小段代码进行纠错/检测。最好把这样的代码放在小的、多的、独立于主要软件/固件的。它的主要任务只是检查/纠正。如果硬件电路/固件可靠(例如它比其余部分更抗辐射 - 或具有多个电路/逻辑),那么您可以考虑使用它进行纠错。但如果不是,最好将其作为错误检测。校正可以通过外部系统/设备进行。对于纠错,您可以考虑使用诸如 Hamming/Golay23 之类的基本纠错算法,因为它们可以更轻松地在电路/软件中实现。但这最终取决于您团队的能力。对于错误检测,通常使用 CRC。
...支持恢复的硬件现在,到了这个问题最困难的方面。最终,恢复要求负责恢复的硬件至少能正常工作。如果硬件永久损坏(通常在其总电离剂量达到一定水平后发生),那么(遗憾的是)软件无法帮助恢复。因此,对于暴露于高辐射水平的设备(例如卫星)来说,硬件是最重要的问题。
In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:
除了上述针对单个事件扰乱导致固件错误的建议外,我还建议您:
Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system
Filter in your ADC reading. Do notuse the ADC reading directly. Filter it by median filter, mean filter, or any other filters - nevertrust single reading value. Sample more, not less - reasonably.
子系统间通信协议中的错误检测和/或错误纠正算法。这是另一个几乎必须的,以避免从其他系统接收到不完整/错误的信号
过滤您的 ADC 读数。千万不能使用ADC直接读取。通过中值过滤器、均值过滤器或任何其他过滤器对其进行过滤 -永远不要相信单个读数值。采样更多,而不是更少 - 合理。
回答by rsjaffe
NASA has a paper on radiation-hardenedsoftware. It describes three main tasks:
NASA 有一篇关于抗辐射软件的论文。它描述了三个主要任务:
- Regular monitoring of memory for errors then scrubbing out those errors,
- robust error recovery mechanisms, and
- the ability to reconfigure if something no longer works.
- 定期监控内存中的错误,然后清除这些错误,
- 强大的错误恢复机制,以及
- 如果某些东西不再起作用,则重新配置的能力。
Note that the memory scan rate should be frequent enough that multi-bit errors rarely occur, as most ECCmemory can recover from single-bit errors, not multi-bit errors.
请注意,内存扫描速率应该足够频繁,以便很少发生多位错误,因为大多数ECC内存可以从单位错误中恢复,而不是多位错误。
Robust error recovery includes control flow transfer (typically restarting a process at a point before the error), resource release, and data restoration.
稳健的错误恢复包括控制流传输(通常在错误发生前的某个时间点重新启动进程)、资源释放和数据恢复。
Their main recommendation for data restoration is to avoid the need for it, through having intermediate data be treated as temporary, so that restarting before the error also rolls back the data to a reliable state. This sounds similar to the concept of "transactions" in databases.
他们对数据恢复的主要建议是避免需要它,通过将中间数据视为临时数据,以便在错误之前重新启动也将数据回滚到可靠状态。这听起来类似于数据库中“事务”的概念。
They discuss techniques particularly suitable for object-oriented languages such as C++. For example
他们讨论了特别适用于 C++ 等面向对象语言的技术。例如
- Software-based ECCs for contiguous memory objects
- Programming by Contract: verifying preconditions and postconditions, then checking the object to verify it is still in a valid state.
- 用于连续内存对象的基于软件的 ECC
- 契约式编程:验证前置条件和后置条件,然后检查对象以验证它仍然处于有效状态。
And, it just so happens, NASA has used C++ for major projects such as the Mars Rover.
而且,碰巧的是,NASA 已将 C++ 用于诸如火星探测器之类的重大项目。
C++ class abstraction and encapsulation enabled rapid development and testing among multiple projects and developers.
C++ 类抽象和封装使多个项目和开发人员之间能够快速开发和测试。
They avoided certain C++ features that could create problems:
他们避免了某些可能会产生问题的 C++ 特性:
- Exceptions
- Templates
- Iostream (no console)
- Multiple inheritance
- Operator overloading (other than
new
anddelete
) - Dynamic allocation (used a dedicated memory pool and placement
new
to avoid the possibility of system heap corruption).
- 例外
- 模板
- Iostream(无控制台)
- 多重继承
- 运算符重载(除了
new
anddelete
) - 动态分配(使用专用内存池和放置
new
以避免系统堆损坏的可能性)。
回答by Artelius
Here are some thoughts and ideas:
以下是一些想法和想法:
Use ROM more creatively.
更有创意地使用ROM。
Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.
在 ROM 中存储您可以存储的任何内容。将查找表存储在 ROM 中,而不是计算事物。(确保您的编译器将您的查找表输出到只读部分!在运行时打印出内存地址以进行检查!)将您的中断向量表存储在 ROM 中。当然,运行一些测试来看看你的 ROM 与你的 RAM 相比有多可靠。
Use your best RAM for the stack.
为堆栈使用最好的 RAM。
SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.
堆栈中的 SEU 可能是最有可能的崩溃源,因为它是诸如索引变量、状态变量、返回地址和各种指针之类的东西通常存在的地方。
Implement timer-tick and watchdog timer routines.
实现定时器滴答和看门狗定时器例程。
You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.
您可以在每次计时器滴答时运行“健全性检查”例程,以及处理系统锁定的看门狗例程。您的主代码还可以定期增加一个计数器以指示进度,并且健全性检查例程可以确保这已经发生。
Implement error-correcting-codesin software.
在软件中实现纠错码。
You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.
您可以为数据添加冗余,以便能够检测和/或纠正错误。这会增加处理时间,可能使处理器暴露在辐射下的时间更长,从而增加出错的机会,因此您必须考虑权衡。
Remember the caches.
记住缓存。
Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.
检查 CPU 缓存的大小。您最近访问或修改的数据可能会在缓存中。我相信你至少可以禁用一些缓存(以很大的性能成本);您应该尝试此操作以查看缓存对 SEU 的敏感程度。如果缓存比 RAM 更坚固,那么您可以定期读取和重写关键数据,以确保它保留在缓存中并使 RAM 恢复正常。
Use page-fault handlers cleverly.
巧妙地使用页面错误处理程序。
If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)
如果您将内存页标记为不存在,则当您尝试访问它时,CPU 将发出页错误。您可以创建一个页面错误处理程序,在为读取请求提供服务之前进行一些检查。(PC 操作系统使用它来透明地加载已交换到磁盘的页面。)
Use assembly language for critical things (which could be everything).
将汇编语言用于关键事物(可能是一切)。
With assembly language, you knowwhat is in registers and what is in RAM; you knowwhat special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.
使用汇编语言,您知道寄存器中的内容和 RAM 中的内容;您知道CPU 正在使用哪些特殊的 RAM 表,并且您可以以迂回的方式设计事物以降低风险。
Use objdump
to actually look at the generated assembly language, and work out how much code each of your routines takes up.
用objdump
实际查看生成的汇编语言,制定出了多少代码每个程序的占用。
If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.
如果您使用的是像 Linux 这样的大型操作系统,那么您就是在自找麻烦;有太多的复杂性和太多的错误。
Remember it is a game of probabilities.
请记住,这是一个概率游戏。
A commenter said
一位评论者说
Every routine you write to catch errors will be subject to failing itself from the same cause.
您为捕获错误而编写的每个例程都会因相同的原因而失败。
While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.
虽然这是真的,但检查例程正常运行所需的(比如)100 字节代码和数据中出现错误的几率远小于其他地方出现错误的几率。如果您的 ROM 非常可靠,并且几乎所有代码/数据实际上都在 ROM 中,那么您的几率会更高。
Use redundant hardware.
使用冗余硬件。
Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.
使用 2 个或更多具有相同代码的相同硬件设置。如果结果不同,则应触发重置。对于 3 台或更多设备,您可以使用“投票”系统来尝试确定哪台设备遭到入侵。
回答by Eric Towers
You may also be interested in the rich literature on the subject of algorithmic fault tolerance. This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n)
for n
comparisons).
您可能还对有关算法容错主题的丰富文献感兴趣。这包括旧的作业:写那种正确排序其输入时比较恒定的数量将失败(或者稍微更邪恶的版本,当失败的比较级表的渐近数log(n)
的n
比较)。
A place to start reading is Huang and Abraham's 1984 paper "Algorithm-Based Fault Tolerance for Matrix Operations". Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level).
一个开始阅读的地方是 Huang 和 Abraham 1984 年的论文“矩阵运算的基于算法的容错”。他们的想法与同态加密计算有点相似(但实际上并不相同,因为他们正在尝试在操作级别进行错误检测/纠正)。
A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's "Algorithm-based fault tolerance applied to high performance computing".
该论文的更新后裔是 Bosilca、Delmas、Dongarra 和 Langou 的“应用于高性能计算的基于算法的容错”。
回答by Lundin
Writing code for radioactive environments is not really any different than writing code for any mission-critical application.
为放射性环境编写代码与为任何关键任务应用程序编写代码并没有什么不同。
In addition to what has already been mentioned, here are some miscellaneous tips:
除了已经提到的内容之外,这里还有一些其他提示:
- Use everyday "bread & butter" safety measures that should be present on any semi-professional embedded system: internal watchdog, internal low-voltage detect, internal clock monitor. These things shouldn't even need to be mentioned in the year 2016 and they are standard on pretty much every modern microcontroller.
- If you have a safety and/or automotive-oriented MCU, it will have certain watchdog features, such as a given time window, inside which you need to refresh the watchdog. This is preferred if you have a mission-critical real-time system.
- In general, use a MCU suitable for these kind of systems, and not some generic mainstream fluff you received in a packet of corn flakes. Almost every MCU manufacturer nowadays have specialized MCUs designed for safety applications (TI, Freescale, Renesas, ST, Infineon etc etc). These have lots of built-in safety features, including lock-step cores: meaning that there are 2 CPU cores executing the same code, and they must agree with each other.
IMPORTANT: You must ensure the integrity of internal MCU registers. All control & status registers of hardware peripherals that are writeable may be located in RAM memory, and are therefore vulnerable.
To protect yourself against register corruptions, preferably pick a microcontroller with built-in "write-once" features of registers. In addition, you need to store default values of all hardware registers in NVM and copy-down those values to your registers at regular intervals. You can ensure the integrity of important variables in the same manner.
Note: always use defensive programming. Meaning that you have to setup allregisters in the MCU and not just the ones used by the application. You don't want some random hardware peripheral to suddenly wake up.
There are all kinds of methods to check for errors in RAM or NVM: checksums, "walking patterns", software ECC etc etc. The best solution nowadays is to not use any of these, but to use a MCU with built-in ECC and similar checks. Because doing this in software is complex, and the error check in itself could therefore introduce bugs and unexpected problems.
- Use redundancy. You could store both volatile and non-volatile memory in two identical "mirror" segments, that must always be equivalent. Each segment could have a CRC checksum attached.
- Avoid using external memories outside the MCU.
- Implement a default interrupt service routine / default exception handler for all possible interrupts/exceptions. Even the ones you are not using. The default routine should do nothing except shutting off its own interrupt source.
Understand and embrace the concept of defensive programming. This means that your program needs to handle all possible cases, even those that cannot occur in theory. Examples.
High quality mission-critical firmware detects as many errors as possible, and then ignores them in a safe manner.
- Never write programs that rely on poorly-specified behavior. It is likely that such behavior might change drastically with unexpected hardware changes caused by radiation or EMI. The best way to ensure that your program is free from such crap is to use a coding standard like MISRA, together with a static analyser tool. This will also help with defensive programming and with weeding out bugs (why would you not want to detect bugs in any kind of application?).
IMPORTANT: Don't implement any reliance of the default values of static storage duration variables. That is, don't trust the default contents of the
.data
or.bss
. There could be any amount of time between the point of initialization to the point where the variable is actually used, there could have been plenty of time for the RAM to get corrupted. Instead, write the program so that all such variables are set from NVM in run-time, just before the time when such a variable is used for the first time.In practice this means that if a variable is declared at file scope or as
static
, you should never use=
to initialize it (or you could, but it is pointless, because you cannot rely on the value anyhow). Always set it in run-time, just before use. If it is possible to repeatedly update such variables from NVM, then do so.Similarly in C++, don't rely on constructors for static storage duration variables. Have the constructor(s) call a public "set-up" routine, which you can also call later on in run-time, straight from the caller application.
If possible, remove the "copy-down" start-up code that initializes
.data
and.bss
(and calls C++ constructors) entirely, so that you get linker errors if you write code relying on such. Many compilers have the option to skip this, usually called "minimal/fast start-up" or similar.This means that any external libraries have to be checked so that they don't contain any such reliance.
Implement and define a safe state for the program, to where you will revert in case of critical errors.
- Implementing an error report/error log system is always helpful.
- 使用任何半专业嵌入式系统都应具备的日常“面包和黄油”安全措施:内部看门狗、内部低电压检测、内部时钟监视器。这些事情甚至不需要在 2016 年提及,它们几乎是每个现代微控制器的标准配置。
- 如果您有一个面向安全和/或汽车的 MCU,它将具有某些看门狗功能,例如给定的时间窗口,您需要在该时间窗口内刷新看门狗。如果您有一个关键任务实时系统,这是首选。
- 通常,使用适合此类系统的 MCU,而不是您在一包玉米片中收到的一些通用主流绒毛。现在几乎每个 MCU 制造商都有专为安全应用(TI、飞思卡尔、瑞萨、ST、英飞凌等)设计的 MCU。它们具有许多内置的安全功能,包括锁步内核:这意味着有 2 个 CPU 内核执行相同的代码,并且它们必须彼此一致。
重要提示:您必须确保内部 MCU 寄存器的完整性。可写的硬件外围设备的所有控制和状态寄存器可能位于 RAM 存储器中,因此容易受到攻击。
为了保护自己免受寄存器损坏,最好选择具有内置“一次写入”寄存器功能的微控制器。此外,您需要将所有硬件寄存器的默认值存储在 NVM 中,并定期将这些值复制到您的寄存器中。您可以以相同的方式确保重要变量的完整性。
注意:始终使用防御性编程。这意味着您必须在 MCU 中设置所有寄存器,而不仅仅是应用程序使用的寄存器。您不希望某些随机硬件外设突然唤醒。
有各种方法可以检查 RAM 或 NVM 中的错误:校验和、“行走模式”、软件 ECC 等。现在最好的解决方案是不使用任何这些方法,而是使用带有内置 ECC 和类似的检查。因为在软件中执行此操作很复杂,因此错误检查本身可能会引入错误和意外问题。
- 使用冗余。您可以将易失性和非易失性存储器存储在两个相同的“镜像”段中,它们必须始终是等效的。每个段都可以附加一个 CRC 校验和。
- 避免在 MCU 之外使用外部存储器。
- 为所有可能的中断/异常实现默认中断服务例程/默认异常处理程序。甚至那些你没有使用的。默认例程除了关闭自己的中断源之外什么都不做。
理解并接受防御性编程的概念。这意味着您的程序需要处理所有可能的情况,甚至那些理论上不可能发生的情况。例子。
高质量的关键任务固件会检测尽可能多的错误,然后以安全的方式忽略它们。
- 永远不要编写依赖于不明确行为的程序。这种行为很可能会随着辐射或 EMI 引起的意外硬件变化而发生巨大变化。确保您的程序没有此类垃圾的最佳方法是使用诸如 MISRA 之类的编码标准以及静态分析器工具。这也将有助于防御性编程和清除错误(为什么您不想在任何类型的应用程序中检测错误?)。
重要提示:不要依赖静态存储持续时间变量的默认值。也就是说,不要相信
.data
or的默认内容.bss
。从初始化点到实际使用变量的点之间可能有任何时间,RAM 可能有足够的时间损坏。相反,编写程序以便在运行时从 NVM 设置所有此类变量,就在第一次使用此类变量之前。实际上,这意味着如果变量在文件范围或 as 中声明
static
,则永远不应该使用=
来初始化它(或者您可以,但它毫无意义,因为无论如何您都不能依赖该值)。始终在运行时设置它,就在使用之前。如果可以从 NVM 重复更新此类变量,请执行此操作。同样在 C++ 中,不要依赖于静态存储持续时间变量的构造函数。让构造函数调用公共“设置”例程,您也可以稍后在运行时直接从调用方应用程序调用该例程。
如果可能,请完全删除用于初始化
.data
和.bss
(并调用 C++ 构造函数)的“向下复制”启动代码,以便在编写依赖于此的代码时会出现链接器错误。许多编译器可以选择跳过这一点,通常称为“最小/快速启动”或类似的。这意味着必须检查任何外部库,以便它们不包含任何此类依赖。
实施并定义程序的安全状态,在发生严重错误时您将恢复到该状态。
- 实施错误报告/错误日志系统总是有帮助的。
回答by supercat
It may be possible to use C to write programs that behave robustly in such environments, but only if most forms of compiler optimization are disabled. Optimizing compilers are designed to replace many seemingly-redundant coding patterns with "more efficient" ones, and may have no clue that the reason the programmer is testing x==42
when the compiler knows there's no way x
could possibly hold anything else is because the programmer wants to prevent the execution of certain code with x
holding some other value--even in cases where the only way it could hold that value would be if the system received some kind of electrical glitch.
可以使用 C 编写在此类环境中运行稳健的程序,但前提是大多数形式的编译器优化都被禁用。优化编译器旨在用“更高效”的编码模式替换许多看似冗余的编码模式,并且可能不知道程序员x==42
在编译器知道x
不可能容纳任何其他内容时进行测试的原因是因为程序员想要阻止执行某些具有x
其他值的代码- 即使在它可以保持该值的唯一方法是系统收到某种电气故障的情况下也是如此。
Declaring variables as volatile
is often helpful, but may not be a panacea.
Of particular importance, note that safe coding often requires that dangerous
operations have hardware interlocks that require multiple steps to activate,
and that code be written using the pattern:
声明变量 asvolatile
通常很有帮助,但可能不是万能的。特别重要的是,请注意安全编码通常要求危险操作具有需要多个步骤才能激活的硬件互锁,并且使用以下模式编写代码:
... code that checks system state
if (system_state_favors_activation)
{
prepare_for_activation();
... code that checks system state again
if (system_state_is_valid)
{
if (system_state_favors_activation)
trigger_activation();
}
else
perform_safety_shutdown_and_restart();
}
cancel_preparations();
If a compiler translates the code in relatively literal fashion, and if all
the checks for system state are repeated after the prepare_for_activation()
,
the system may be robust against almost any plausible single glitch event,
even those which would arbitrarily corrupt the program counter and stack. If
a glitch occurs just after a call to prepare_for_activation()
, that would imply
that activation would have been appropriate (since there's no other reason
prepare_for_activation()
would have been called before the glitch). If the
glitch causes code to reach prepare_for_activation()
inappropriately, but there
are no subsequent glitch events, there would be no way for code to subsequently
reach trigger_activation()
without having passed through the validation check or calling cancel_preparations first [if the stack glitches, execution might proceed to a spot just before trigger_activation()
after the context that called prepare_for_activation()
returns, but the call to cancel_preparations()
would have occurred between the calls to prepare_for_activation()
and trigger_activation()
, thus rendering the latter call harmless.
如果编译器以相对字面的方式翻译代码,并且如果在 之后重复对系统状态的所有检查prepare_for_activation()
,则系统可能对几乎任何可能的单一故障事件具有鲁棒性,即使是那些会任意破坏程序计数器和堆栈的事件。如果在调用 之后发生故障prepare_for_activation()
,则意味着激活是适当的(因为prepare_for_activation()
在故障之前没有其他原因
会被调用)。如果故障导致代码prepare_for_activation()
不恰当地到达,但没有后续的故障事件,则在没有trigger_activation()
通过验证检查或首先调用 cancel_preparations 的情况下,代码将无法随后到达[如果堆栈出现故障,执行可能会继续到一个点就在之前trigger_activation()
在调用的上下文prepare_for_activation()
返回之后,但调用cancel_preparations()
将发生在调用prepare_for_activation()
和之间trigger_activation()
,从而使后一个调用无害。
Such code may be safe in traditional C, but not with modern C compilers. Such compilers can be very dangerous in that sort of environment because aggressive they strive to only include code which will be relevant in situations that could come about via some well-defined mechanism and whose resulting consequences would also be well defined. Code whose purpose would be to detect and clean up after failures may, in some cases, end up making things worse. If the compiler determines that the attempted recovery would in some cases invoke undefined behavior, it may infer that the conditions that would necessitate such recovery in such cases cannot possibly occur, thus eliminating the code that would have checked for them.
此类代码在传统 C 中可能是安全的,但在现代 C 编译器中则不然。这样的编译器在那种环境中可能非常危险,因为它们积极地努力只包含在可能通过一些明确定义的机制出现的情况下相关的代码,并且其结果也将得到明确定义。在某些情况下,旨在检测和清理故障后的代码可能会使事情变得更糟。如果编译器确定尝试的恢复在某些情况下会调用未定义的行为,则它可以推断在这种情况下需要进行此类恢复的条件不可能发生,从而消除本应检查它们的代码。
回答by Dmitry Grigoryev
This is an extremely broad subject. Basically, you can't really recover from memory corruption, but you can at least try to fail promptly. Here are a few techniques you could use:
这是一个极其广泛的主题。基本上,您无法真正从内存损坏中恢复,但您至少可以尝试立即失败。以下是您可以使用的一些技巧:
checksum constant data. If you have any configuration data which stays constant for a long time (including hardware registers you have configured), compute its checksum on initialization and verify it periodically. When you see a mismatch, it's time to re-initialize or reset.
store variables with redundancy. If you have an important variable
x
, write its value inx1
,x2
andx3
and read it as(x1 == x2) ? x2 : x3
.implement program flow monitoring. XOR a global flag with a unique value in important functions/branches called from the main loop. Running the program in a radiation-free environment with near-100% test coverage should give you the list of acceptable values of the flag at the end of the cycle. Reset if you see deviations.
monitor the stack pointer. In the beginning of the main loop, compare the stack pointer with its expected value. Reset on deviation.
校验和常量数据。如果您有任何长时间保持不变的配置数据(包括您配置的硬件寄存器),请在初始化时计算其校验和并定期验证它。当您看到不匹配时,是时候重新初始化或重置了。
存储具有冗余的变量。如果您有一个重要的变量
x
,请将其值写入x1
,x2
然后将其x3
读取为(x1 == x2) ? x2 : x3
。实施程序流监控。XOR 全局标志在从主循环调用的重要函数/分支中具有唯一值。在测试覆盖率接近 100% 的无辐射环境中运行该程序应该会在周期结束时为您提供可接受的标志值列表。如果您看到偏差,请重置。
监视堆栈指针。在主循环的开始,将堆栈指针与其期望值进行比较。偏差复位。
回答by OldFrank
What could help you is a watchdog. Watchdogs were used extensively in industrial computing in the 1980s. Hardware failures were much more common then - another answer also refers to that period.
可以帮助您的是看门狗。看门狗在 1980 年代被广泛用于工业计算。那时硬件故障更为常见 - 另一个答案也涉及那个时期。
A watchdog is a combined hardware/software feature. The hardware is a simple counter that counts down from a number (say 1023) to zero. TTLor other logic could be used.
看门狗是一种组合的硬件/软件功能。硬件是一个简单的计数器,可以从一个数字(比如 1023)倒数到零。可以使用TTL或其他逻辑。
The software has been designed as such that one routine monitors the correct operation of all essential systems. If this routine completes correctly = finds the computer running fine, it sets the counter back to 1023.
该软件的设计是通过一个例程来监控所有基本系统的正确操作。如果此例程正确完成 = 发现计算机运行良好,则将计数器设置回 1023。
The overall design is so that under normal circumstances, the software prevents that the hardware counter will reach zero. In case the counter reaches zero, the hardware of the counter performs its one-and-only task and resets the entire system. From a counter perspective, zero equals 1024 and the counter continues counting down again.
总体设计是为了在正常情况下,软件防止硬件计数器达到零。如果计数器达到零,计数器的硬件将执行其唯一的任务并重置整个系统。从计数器的角度来看,零等于 1024,并且计数器再次继续向下计数。
This watchdog ensures that the attached computer is restarted in a many, many cases of failure. I must admit that I'm not familiar with hardware that is able to perform such a function on today's computers. Interfaces to external hardware are now a lot more complex than they used to be.
这个看门狗确保连接的计算机在很多很多失败的情况下重新启动。我必须承认,我对能够在当今计算机上执行此类功能的硬件并不熟悉。与外部硬件的接口现在比以前复杂得多。
An inherent disadvantage of the watchdog is that the system is not available from the time it fails until the watchdog counter reaches zero + reboot time. While that time is generally much shorter than any external or human intervention, the supported equipment will need to be able to proceed without computer control for that timeframe.
看门狗的一个固有缺点是系统从它出现故障直到看门狗计数器达到零 + 重启时间是不可用的。虽然该时间通常比任何外部或人工干预短得多,但支持的设备需要能够在没有计算机控制的情况下继续进行。
回答by jkflying
Since you specifically ask for software solutions, and you are using C++, why not use operator overloading to make your own, safe datatypes? For example:
既然您特别要求软件解决方案,并且您使用的是 C++,为什么不使用运算符重载来创建您自己的安全数据类型?例如:
Instead of using uint32_t
(and double
, int64_t
etc), make your own SAFE_uint32_t
which contains a multiple (minimum of 3) of uint32_t. Overload all of the operations you want (* + - / << >> = == != etc) to perform, and make the overloaded operations perform independently on each internal value, ie don't do it once and copy the result. Both before and after, check that all of the internal values match. If values don't match, you can update the wrong one to the value with the most common one. If there is no most-common value, you can safely notify that there is an error.
而不是使用的uint32_t
(和double
,int64_t
等等),让你自己SAFE_uint32_t
包含uint32_t的倍数(3最小值)。重载所有你想要(* + - / << >> = == != etc)执行的操作,并使重载的操作对每个内部值独立执行,即不要执行一次并复制结果。在之前和之后,检查所有内部值是否匹配。如果值不匹配,您可以将错误的值更新为最常见的值。如果没有最常见的值,您可以安全地通知有错误。
This way it doesn't matter if corruption occurs in the ALU, registers, RAM, or on a bus, you will still have multiple attempts and a very good chance of catching errors. Note however though that this only works for the variables you can replace - your stack pointer for example will still be susceptible.
这样,ALU、寄存器、RAM 或总线上是否发生损坏都没有关系,您仍然会有多次尝试并且很有可能发现错误。但是请注意,这仅适用于您可以替换的变量 - 例如您的堆栈指针仍然容易受到影响。
A side story: I ran into a similar issue, also on an old ARM chip. It turned out to be a toolchain which used an old version of GCC that, together with the specific chip we used, triggered a bug in certain edge cases that would (sometimes) corrupt values being passed into functions. Make sure your device doesn't have any problems before blaming it on radio-activity, and yes, sometimes it is a compiler bug =)
一个小故事:我遇到了类似的问题,也是在旧的 ARM 芯片上。结果证明这是一个使用旧版本 GCC 的工具链,它与我们使用的特定芯片一起,在某些边缘情况下触发了一个错误,这会(有时)破坏传递给函数的值。在将其归咎于无线电活动之前,请确保您的设备没有任何问题,是的,有时它是编译器错误 =)
回答by abligh
This answer assumes you are concerned with having a system that works correctly, over and above having a system that is minimum cost or fast; most people playing with radioactive things value correctness / safety over speed / cost
这个答案假设您关心的是拥有一个正常工作的系统,而不是拥有一个成本最低或速度快的系统;大多数玩放射性物品的人重视正确性/安全性而不是速度/成本
Several people have suggested hardware changes you can make (fine - there's lots of good stuff here in answers already and I don't intend repeating all of it), and others have suggested redundancy (great in principle), but I don't think anyone has suggested how that redundancy might work in practice. How do you fail over? How do you know when something has 'gone wrong'? Many technologies work on the basis everything will work, and failure is thus a tricky thing to deal with. However, some distributed computing technologies designed for scale expectfailure (after all with enough scale, failure of one node of many is inevitable with any MTBF for a single node); you can harness this for your environment.
有几个人建议您可以进行硬件更改(很好 - 答案中已经有很多好东西,我不打算重复所有内容),其他人建议冗余(原则上很好),但我不认为任何人都建议这种冗余在实践中如何运作。你如何失败?你怎么知道什么时候“出错”了?许多技术在一切正常的基础上工作,因此失败是一件棘手的事情。然而,一些为规模设计的分布式计算技术会出现故障(毕竟规模足够大,单节点的任何MTBF都不可避免地会出现多个节点中的一个节点的故障);您可以将其用于您的环境。
Here are some ideas:
这里有一些想法:
Ensure that your entire hardware is replicated
n
times (wheren
is greater than 2, and preferably odd), and that each hardware element can communicate with each other hardware element. Ethernet is one obvious way to do that, but there are many other far simpler routes that would give better protection (e.g. CAN). Minimise common components (even power supplies). This may mean sampling ADC inputs in multiple places for instance.Ensure your application state is in a single place, e.g. in a finite state machine. This can be entirely RAM based, though does not preclude stable storage. It will thus be stored in several place.
Adopt a quorum protocol for changes of state. See RAFTfor example. As you are working in C++, there are well known libraries for this. Changes to the FSM would only get made when a majority of nodes agree. Use a known good library for the protocol stack and the quorum protocol rather than rolling one yourself, or all your good work on redundancy will be wasted when the quorum protocol hangs up.
Ensure you checksum (e.g. CRC/SHA) your FSM, and store the CRC/SHA in the FSM itself (as well as transmitting in the message, and checksumming the messages themselves). Get the nodes to check their FSM regularly against these checksum, checksum incoming messages, and check their checksum matches the checksum of the quorum.
Build as many other internal checks into your system as possible, making nodes that detect their own failure reboot (this is better than carrying on half working provided you have enough nodes). Attempt to let them cleanly remove themselves from the quorum during rebooting in case they don't come up again. On reboot have them checksum the software image (and anything else they load) and do a full RAM test before reintroducing themselves to the quorum.
Use hardware to support you, but do so carefully. You can get ECC RAM, for instance, and regularly read/write through it to correct ECC errors (and panic if the error is uncorrectable). However (from memory) static RAM is far more tolerant of ionizing radiation than DRAM is in the first place, so it maybe better to use static DRAM instead. See the first point under 'things I would not do' as well.
确保您的整个硬件被复制
n
次数(其中n
大于 2,最好是奇数),并且每个硬件元素可以与其他硬件元素通信。以太网是一种显而易见的方法,但还有许多其他更简单的路由可以提供更好的保护(例如 CAN)。尽量减少常用组件(甚至电源)。例如,这可能意味着在多个位置对 ADC 输入进行采样。确保您的应用程序状态在一个地方,例如在有限状态机中。这可以完全基于 RAM,但不排除稳定存储。因此,它将存储在多个地方。
采用法定人数协议来更改状态。例如,参见RAFT。当您在 C++ 中工作时,有一些众所周知的库。只有在大多数节点同意时才会对 FSM 进行更改。为协议栈和仲裁协议使用一个已知好的库,而不是自己滚动一个,否则当仲裁协议挂掉时,你在冗余方面的所有工作都将被浪费掉。
确保对 FSM 进行校验和(例如 CRC/SHA),并将 CRC/SHA 存储在 FSM 本身中(以及在消息中传输,并对消息本身进行校验和)。让节点根据这些校验和定期检查它们的 FSM,校验传入消息,并检查它们的校验和是否与仲裁的校验和匹配。
将尽可能多的其他内部检查构建到您的系统中,使检测到自身故障的节点重新启动(如果您有足够的节点,这比进行一半的工作要好)。尝试让他们在重新启动期间从法定人数中干净地删除自己,以防他们再次出现。在重新启动时,让他们对软件映像(以及他们加载的任何其他内容)进行校验和,并在将自己重新引入法定人数之前进行完整的 RAM 测试。
使用硬件来支持你,但要小心。例如,您可以获得 ECC RAM,并通过它定期读/写以纠正 ECC 错误(如果错误无法纠正,则恐慌)。但是(从内存中)的静态RAM是电离辐射比DRAM是摆在首位,因此它的更宽容可能是更好的使用静态DRAM代替。请参阅“我不会做的事情”下的第一点。
Let's say you have an 1% chance of failure of any given node within one day, and let's pretend you can make failures entirely independent. With 5 nodes, you'll need three to fail within one day, which is a .00001% chance. With more, well, you get the idea.
假设您在一天内有 1% 的任何给定节点发生故障的可能性,并且假设您可以使故障完全独立。使用 5 个节点,您需要在一天内三个节点失败,这是 0.00001% 的机会。随着更多,你明白了。
Things I would notdo:
我不会做的事情:
Underestimate the value of not having the problem to start off with.Unless weight is a concern, a large block of metal around your device is going to be a far cheaper and more reliable solution than a team of programmers can come up with. Ditto optical coupling of inputs of EMI is an issue, etc. Whatever, attempt when sourcing your components to source those rated best against ionizing radiation.
Roll your own algorithms. People have done this stuff before. Use their work. Fault tolerance and distributed algorithms are hard. Use other people's work where possible.
Use complicated compiler settings in the naive hope you detect more failures.If you are lucky, you may detect more failures. More likely, you will use a code-path within the compiler which has been less tested, particularly if you rolled it yourself.
Use techniques which are untested in your environment.Most people writing high availability software have to simulate failure modes to check their HA works correctly, and miss many failure modes as a result. You are in the 'fortunate' position of having frequent failures on demand. So test each technique, and ensure its application actual improves MTBF by an amount that exceeds the complexity to introduce it (with complexity comes bugs). Especially apply this to my advice re quorum algorithms etc.
低估开始时没有问题的价值。除非考虑重量,否则设备周围的大块金属将是比程序员团队能想出的更便宜、更可靠的解决方案。同样,EMI 输入的光耦合也是一个问题,等等。无论如何,在采购您的组件时尝试采购那些抗电离辐射额定值最好的组件。
滚动你自己的算法。人们以前做过这种事情。使用他们的工作。容错和分布式算法很难。尽可能使用其他人的作品。
天真地使用复杂的编译器设置希望你能检测到更多的失败。如果幸运的话,您可能会检测到更多故障。更有可能的是,您将在编译器中使用经过较少测试的代码路径,特别是如果您自己滚动它。
使用在您的环境中未经测试的技术。大多数编写高可用性软件的人都必须模拟故障模式来检查他们的 HA 是否正常工作,结果会错过许多故障模式。您处于“幸运”的位置,需要时常发生故障。因此,测试每种技术,并确保其应用实际提高 MTBF 的量超过引入它的复杂性(复杂性会带来错误)。特别是将其应用于我的建议仲裁算法等。