C++ 哪个更快:堆栈分配或堆分配

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/161053/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:21:45  来源:igfitidea点击:

Which is faster: Stack allocation or Heap allocation

c++performancememorystackheap

提问by Adam

This question may sound fairly elementary, but this is a debate I had with another developer I work with.

这个问题可能听起来相当初级,但这是我与另一位与我合作的开发人员进行的辩论。

I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.

我小心地在我可以的地方堆栈分配东西,而不是堆分配它们。他正在跟我说话,看着我的肩膀,并评论说这没有必要,因为他们在表现方面是一样的。

I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).

我一直认为堆栈的增长是恒定时间,而堆分配的性能取决于堆的当前复杂性,用于分配(找到合适大小的孔)和取消分配(折叠孔以减少碎片,如如果我没记错的话,许多标准库实现在删除过程中需要时间来执行此操作)。

This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerkscompiler for the PPCarchitecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.

这让我觉得可能非常依赖编译器。特别是对于这个项目,我将Metrowerks编译器用于PPC架构。深入了解这种组合将是最有帮助的,但一般来说,对于 GCC 和 MSVC++,情况如何?堆分配的性能不如堆栈分配吗?没有区别吗?或者差异如此之小,以至于成为毫无意义的微优化。

采纳答案by Torbj?rn Gyllebring

Stack allocation is much faster since all it really does is move the stack pointer. Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.

堆栈分配要快得多,因为它真正做的就是移动堆栈指针。使用内存池,您可以从堆分配中获得可比的性能,但这会稍微增加复杂性和它自己的麻烦。

Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.

此外,堆栈与堆不仅是性能方面的考虑;它还告诉您很多关于对象的预期生命周期的信息。

回答by Dan Lenski

Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:

堆栈要快得多。它实际上只在大多数架构上使用一条指令,在大多数情况下,例如在 x86 上:

sub esp, 0x10

(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)

(这会将堆栈指针向下移动 0x10 个字节,从而“分配”这些字节以供变量使用。)

Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do recursion :-)

当然,堆栈的大小非常非常有限,因为您会很快发现是否过度使用堆栈分配或尝试进行递归:-)

Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.

此外,几乎没有理由优化不需要它的代码的性能,例如通过分析证明。“过早优化”通常会导致比其价值更多的问题。

My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.

我的经验法则:如果我知道我将在编译时需要一些数据,并且它的大小小于几百字节,我会堆栈分配它。否则我堆分配它。

回答by Max Lybbert

Honestly, it's trivial to write a program to compare the performance:

老实说,编写一个程序来比较性能很简单:

#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum
}

int main()
{
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticks\n";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    };
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticks\n";
}

It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.

据说愚蠢的一致性是小头脑的妖精。显然,优化编译器是许多程序员心目中的妖精。这个讨论曾经位于答案的底部,但人们显然懒得读那么远,所以我将它移到这里以避免收到我已经回答的问题。

An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.

优化编译器可能会注意到这段代码什么都不做,并可能将其全部优化掉。做这样的事情是优化器的工作,与优化器抗争是白费力气。

I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.

我建议在关闭优化的情况下编译此代码,因为没有什么好方法可以欺骗当前使用的每个优化器或将来使用的优化器。

Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.

任何打开优化器然后抱怨与之抗争的人都应该受到公众的嘲笑。

If I cared about nanosecond precision I wouldn't use std::clock(). If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.

如果我关心纳秒精度,我不会使用std::clock(). 如果我想将结果作为博士论文发表,我会对此做更大的处理,我可能会比较 GCC、Tendra/Ten15、LLVM、Watcom、Borland、Visual C++、Digital Mars、ICC 和其他编译器。事实上,堆分配比堆栈分配花费的时间长数百倍,我认为进一步调查这个问题没有任何用处。

The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:

优化器的任务是删除我正在测试的代码。我看不出有任何理由告诉优化器运行,然后试图欺骗优化器实际上不进行优化。但如果我看到这样做的价值,我会做以下一项或多项:

  1. Add a data member to empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."

  2. Declare evolatile, but volatileis often compiled incorrectly(PDF).

  3. Take the address of einside the loop (and maybe assign it to a variable that is declared externand defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- ewill always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

  1. 向 中添加数据成员empty,并在循环中访问该数据成员;但是如果我只从数据成员中读取数据,优化器可以进行常量折叠并删除循环;如果我只写入数据成员,优化器可能会跳过除循环的最后一次迭代之外的所有内容。此外,问题不是“堆栈分配和数据访问与堆分配和数据访问”。

  2. 声明evolatilevolatile经常被错误编译(PDF)。

  3. 获取e循环内部的地址(并且可能将其分配给extern在另一个文件中声明和定义的变量)。但即使在这种情况下,编译器可能会注意到——至少在堆栈上——e将始终分配在相同的内存地址,然后像上面的 (1) 那样进行常量折叠。我得到了循环的所有迭代,但实际上从未分配过对象。

Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling deletewould (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call deleteafter we've got our time measurement.

除了显而易见的问题之外,该测试还存在缺陷,因为它同时测量了分配和解除分配,并且原始问题没有询问解除分配。当然,分配在堆栈上的变量会在其作用域结束时自动释放,因此不调用delete会 (1) 使数字倾斜(堆栈释放包含在有关堆栈分配的数字中,因此测量堆释放是公平的)和( 2) 导致非常糟糕的内存泄漏,除非我们保留对新指针的引用并delete在获得时间测量后调用。

On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.

在我的机器上,在 Windows 上使用 g++ 3.4.4,对于少于 100000 次分配的堆栈和堆分配,我得到“0 时钟滴答”,即使如此,堆栈分配和“15 时钟滴答”也会得到“0 时钟滴答” " 用于堆分配。当我测量 10,000,000 次分配时,堆栈分配需要 31 个时钟滴答,堆分配需要 1562 个时钟滴答。



Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.

是的,优化编译器可能会忽略创建空对象。如果我理解正确,它甚至可以省略整个第一个循环。当我将迭代增加到 10,000,000 时,堆栈分配需要 31 个时钟滴答,堆分配需要 1562 个时钟滴答。我认为可以肯定地说,在没有告诉 g++ 优化可执行文件的情况下,g++ 没有省略构造函数。



In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.

在我写这篇文章后的几年里,对 Stack Overflow 的偏好一直是从优化的构建中发布性能。总的来说,我认为这是正确的。但是,我仍然认为当您实际上不想优化代码时,要求编译器优化代码是愚蠢的。我觉得这与为代客泊车支付额外费用非常相似,但拒绝交出钥匙。在这种特殊情况下,我不希望优化器运行。

Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):

使用稍微修改过的基准测试版本(以解决原始程序每次通过循环都没有在堆栈上分配东西的有效点)并在没有优化的情况下编译但链接到发布库(以解决我们不这样做的有效点) '不想包括任何因链接到调试库而导致的减速):

#include <cstdio>
#include <chrono>

namespace {
    void on_stack()
    {
        int i;
    }

    void on_heap()
    {
        int* i = new int;
        delete i;
    }
}

int main()
{
    auto begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
        on_stack();
    auto end = std::chrono::system_clock::now();

    std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());

    begin = std::chrono::system_clock::now();
    for (int i = 0; i < 1000000000; ++i)
        on_heap();
    end = std::chrono::system_clock::now();

    std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
    return 0;
}

displays:

显示:

on_stack took 2.070003 seconds
on_heap took 57.980081 seconds

on my system when compiled with the command line cl foo.cc /Od /MT /EHsc.

在我的系统上使用命令行编译时cl foo.cc /Od /MT /EHsc

You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:

您可能不同意我获得非优化构建的方法。没关系:随意修改基准。当我打开优化时,我得到:

on_stack took 0.000000 seconds
on_heap took 51.608723 seconds

Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that on_stackdoesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices that on_heapdoesn't do anything useful, and optimizes it away as well:

不是因为堆栈分配实际上是即时的,而是因为任何半体面的编译器都可以注意到on_stack它没有做任何有用的事情并且可以优化掉。我的 Linux 笔记本电脑上的 GCC 也注意到on_heap它没有做任何有用的事情,并对其进行了优化:

on_stack took 0.000003 seconds
on_heap took 0.000002 seconds

回答by Furious Coder

An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.

我了解到的关于 Xbox 360 Xenon 处理器上的堆栈与堆分配的有趣事情,这也可能适用于其他多核系统,是在堆上分配会导致进入临界区以停止所有其他核心,以便分配不会不冲突。因此,在一个紧凑的循环中,堆栈分配是固定大小数组的方法,因为它可以防止停顿。

This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.

如果您正在为多核/多进程编码,这可能是另一个需要考虑的加速,因为您的堆栈分配只能由运行您的作用域函数的内核查看,并且不会影响任何其他内核/CPU。

回答by Chris Jester-Young

You can write a special heap allocator for specific sizes of objects that is very performant. However, the generalheap allocator is not particularly performant.

您可以为特定大小的对象编写一个非常高效的特殊堆分配器。然而,一般的堆分配器并不是特别高效。

Also I agree with Torbj?rn Gyllebring about the expected lifetime of objects. Good point!

我也同意 Torbj?rn Gyllebring 关于对象的预期生命周期。好点子!

回答by MarkR

I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.

我不认为堆栈分配和堆分配通常可以互换。也希望两者的性能足够一般使用。

I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.

我强烈推荐小项目,以更适合分配范围的方式。对于大项目,堆可能是必要的。

On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.

在具有多个线程的 32 位操作系统上,堆栈通常相当有限(尽管通常至少为几 mb),因为需要分割地址空间,并且迟早一个线程堆栈会遇到另一个线程堆栈。在单线程系统(Linux glibc 无论如何都是单线程的)上,限制要少得多,因为堆栈可以不断增长。

On 64-bit operating systems there is enough address space to make thread stacks quite large.

在 64 位操作系统上,有足够的地址空间使线程堆栈相当大。

回答by Windows programmer

Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.

通常堆栈分配只是从堆栈指针寄存器中减去。这比搜索堆要快很多。

Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.

有时堆栈分配需要添加一页虚拟内存。添加清零内存的新页面不需要从磁盘读取页面,因此通常这仍然比搜索堆要快很多(特别是如果堆的一部分也被调出)。在极少数情况下,您可以构建这样一个示例,恰好在 RAM 中已有的部分堆中恰好有足够的空间可用,但是为堆栈分配新页面必须等待其他页面被写出到磁盘。在这种罕见的情况下,堆速度更快。

回答by Jay

Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.

除了相对于堆分配具有数量级的性能优势之外,堆栈分配更适合长时间运行的服务器应用程序。即使是最好的管理堆最终也会变得如此碎片化,以至于应用程序性能下降。

回答by yogman

A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.

堆栈具有有限的容量,而堆则没有。进程或线程的典型堆栈大约为 8K。一旦分配了大小,就不能更改大小。

A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.

堆栈变量遵循作用域规则,而堆变量则不遵循。如果您的指令指针超出了一个函数,所有与该函数关联的新变量都会消失。

Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.

最重要的是,您无法提前预测整个函数调用链。因此,您仅分配 200 字节可能会引发堆栈溢出。如果您正在编写库而不是应用程序,这一点尤其重要。

回答by FrankHB

Concerns Specific to the C++ Language

特定于 C++ 语言的问题

First of all, there is no so-called "stack" or "heap" allocation mandated by C++. If you are talking about automatic objects in block scopes, they are even not "allocated". (BTW, automatic storage duration in C is definitely NOT the same to "allocated"; the latter is "dynamic" in the C++ parlance.) The dynamically allocated memory is on the free store, not necessarily on "the heap", though the latter is often the (default) implementation.

首先,没有 C++ 要求的所谓“堆栈”或“堆”分配。如果您谈论的是块作用域中的自动对象,它们甚至不会被“分配”。(顺便说一句,C 中的自动存储持续时间肯定与“分配”不同;后者在 C++ 中是“动态”的。)动态分配的内存在free store 上,不一定在“堆”上,尽管后者通常是(默认)实现

Although as per the abstract machinesemantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rulein ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rulesto allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.

虽然根据抽象机器语义规则,自动对象仍然占用内存,但当一个符合 C++ 的实现可以证明这无关紧要时(当它不改变程序的可观察行为时),允许忽略这个事实。此权限由ISO C++ 中的 as-if 规则授予,这也是启用通常优化的一般条款(并且在 ISO C 中也有几乎相同的规则)。除了 as-if 规则,ISO C++ 还有复制省略规则允许省略对象的特定创建。因此省略了涉及的构造函数和析构函数调用。结果,与源代码隐含的幼稚抽象语义相比,这些构造函数和析构函数中的自动对象(如果有)也被消除了。

On the other hand, free store allocation is definitely "allocation" by design. Under ISO C++ rules, such an allocation can be achieved by a call of an allocation function. However, since ISO C++14, there is a new (non-as-if) ruleto allow merging global allocation function (i.e. ::operator new) calls in specific cases. So parts of dynamic allocation operations can also be no-op like the case of automatic objects.

另一方面,免费商店分配绝对是设计上的“分配”。在 ISO C++ 规则下,这样的分配可以通过调用分配函数来实现。然而,从 ISO C++14 开始,有一个新的(non-as-if)规则允许::operator new在特定情况下合并全局分配函数(即)调用。因此,动态分配操作的部分也可以像自动对象的情况一样是无操作的。

Allocation functions allocate resources of memory. Objects can be further allocated based on allocation using allocators. For automatic objects, they are directly presented - although the underlying memory can be accessed and be used to provide memory to other objects (by placement new), but this does not make great sense as the free store, because there is no way to move the resources elsewhere.

分配函数分配内存资源。可以使用分配器基于分配进一步分配对象。对于自动对象,它们是直接呈现的——虽然底层的内存可以被访问,并且可以用来为其他对象提供内存(通过放置new),但这作为自由存储没有多大意义,因为没有办法移动其他地方的资源。

All other concerns are out of the scope of C++. Nevertheless, they can be still significant.

所有其他问题都超出了 C++ 的范围。尽管如此,它们仍然很重要。

About Implementations of C++

关于 C++ 的实现

C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous call/cc), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".

C++ 不公开具体化的活动记录或某种一流的延续(例如著名的call/cc),无法直接操作活动记录框架 - 实现需要将自动对象放置到其中。一旦没有(不可移植的)与底层实现(“本机”不可移植代码,例如内联汇编代码)的互操作,框架的底层分配的省略就变得非常简单。例如,当被调用的函数被内联时,帧可以有效地合并到其他帧中,因此无法显示什么是“分配”。

However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventionsas the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocatedwhen the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.

但是,一旦尊重互操作,事情就会变得复杂。C++ 的典型实现将公开 ISA(指令集体系结构)上的互操作能力,其中一些调用约定作为与本机(ISA 级机器)代码共享的二进制边界。尤其是在维护通常由 ISA 级寄存器直接保存的堆栈指针时,这将是非常昂贵的(可能需要访问特定的机器指令)。堆栈指针指示(当前活动的)函数调用的顶部帧的边界。当进入一个函数调用时,需要一个新的帧,并且堆栈指针被添加或减去(取决于 ISA 的约定)一个不小于所需帧大小的值。然后称该帧已分配当操作后的堆栈指针。函数的参数也可以传递到堆栈帧上,这取决于用于调用的调用约定。框架可以保存 C++ 源代码指定的自动对象(可能包括参数)的内存。在这种实现的意义上,这些对象是“分配的”。当控件退出函数调用时,不再需要该帧,通常通过将堆栈指针恢复到调用前的状态(根据调用约定之前保存的)来释放它。这可以看作是“解除分配”。这些操作使活动记录有效地成为一个 LIFO 数据结构,因此它通常被称为“ (调用)堆栈”。

Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack enginein implementing PUSH/POPinstructions).

因为大多数 C++ 实现(特别是那些针对 ISA 级本机代码并使用汇编语言作为其直接输出的实现)使用类似的策略,所以这种令人困惑的“分配”方案很受欢迎。这种分配(以及解除分配)确实会花费机器周期,并且当(非优化)调用频繁发生时,它可能会很昂贵,即使现代 CPU 微体系结构可以通过硬件为通用代码模式(例如使用实现 /指令中的堆栈引擎)。PUSHPOP

But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purposed, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called heap(or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program or thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than push or pop of stack frames), and hardly possible to be directly optimized by hardware.

但无论如何,总的来说,堆栈帧分配的成本确实比调用分配函数操作自由存储(除非它完全优化掉)的调用要少得多,它本身可以有数百个(如果不是数百万个) :-) 操作来维护堆栈指针和其他状态。分配函数通常基于托管环境提供的 API(例如操作系统提供的运行时)。与为函数调用保存自动对象的目的不同,这种分配是通用的,因此它们不会像堆栈那样具有帧结构。传统上,它们从称为(或多个堆)的池存储中分配空间。与“栈”不同,这里的“堆”概念并不表示正在使用的数据结构;它源自几十年前的早期语言实现。(顺便说一句,调用堆栈通常在程序或线程启动时由环境从堆中分配固定或用户指定的大小。)用例的性质使得从堆中分配和释放要复杂得多(比推或弹出堆栈帧),并且几乎不可能由硬件直接优化。

Effects on Memory Access

对内存访问的影响

The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by <memory>. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of accessrather than allocation, though.

通常的堆栈分配总是将新帧放在顶部,因此它具有相当好的局部性。这对缓存友好。OTOH,在免费存储中随机分配的内存没有这样的属性。从 ISO C++17 开始,有由<memory>. 这种接口的直接目的是允许连续分配的结果在内存中紧密地排列在一起。这承认这样一个事实,即该策略通常有利于当代实现的性能,例如对现代体系结构中的缓存友好。不过,这与access的性能有关,而不是allocation

Concurrency

并发

Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a C++ implementation. OTOH, heaps are often sharedamong the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.

对内存的并发访问的预期可能会在堆栈和堆之间产生不同的影响。在 C++ 实现中,调用堆栈通常由一个执行线程独占所有。OTOH,堆通常在进程中的线程之间共享。对于此类堆,分配和解除分配功能必须保护共享的内部管理数据结构免受数据竞争的影响。因此,由于内部同步操作,堆分配和释放可能会有额外的开销。

Space Efficiency

空间效率

Due to the nature of the use cases and internal data structures, heaps may suffer from internal memory fragmentation, while the stack does not. This does not have direct impacts on the performance of memory allocation, but in a system with virtual memory, low space efficiency may degenerate overall performance of memory access. This is particularly awful when HDD is used as a swap of physical memory. It can cause quite long latency - sometimes billions of cycles.

由于用例和内部数据结构的性质,堆可能会受到内部内存碎片的影响,而堆栈则不会。这对内存分配的性能没有直接影响,但在具有虚拟内存的系统中,低空间效率可能会降低内存访问的整体性能。当 HDD 用作物理内存的交换时,这尤其糟糕。它可能会导致相当长的延迟 - 有时是数十亿个周期。

Limitations of Stack Allocations

堆栈分配的限制

Although stack allocations are often superior in performance than heap allocations in reality, it certainly does not mean stack allocations can always replace heap allocations.

虽然实际上栈分配在性能上往往比堆分配要好,但这并不意味着栈分配总是可以代替堆分配。

First, there is no way to allocate space on the stack with a size specified at runtime in a portable way with ISO C++. There are extensions provided by implementations like allocaand G++'s VLA (variable-length array), but there are reasons to avoid them. (IIRC, Linux source removes the use of VLA recently.) (Also note ISO C99 does have mandated VLA, but ISO C11 turns the support optional.)

首先,无法使用 ISO C++ 以可移植的方式在运行时指定大小的堆栈上分配空间。alloca和 G++ 的 VLA(可变长度数组)之类的实现提供了扩展,但有理由避免它们。(IIRC,Linux 源最近删除了 VLA 的使用。)(另请注意,ISO C99 确实强制要求了 VLA,但 ISO C11 将支持变为可选。)

Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.

其次,没有可靠且便携的方法来检测堆栈空间耗尽。这通常称为堆栈溢出(嗯,本网站的词源),但可能更准确地说,堆栈溢出。实际上,这通常会导致无效的内存访问,然后程序的状态就会被破坏(……或者更糟的是,一个安全漏洞)。实际上,ISO C++ 没有“堆栈”的概念,并且在资源耗尽时使其成为未定义的行为。应谨慎为自动对象留出多少空间。

If the stack space runs out, there are too many objects allocated in the stack, which can be caused by too many active calls of functions or improper use of automatic objects. Such cases may suggest the existence of bugs, e.g. a recursive function call without correct exit conditions.

如果堆栈空间用完,则堆栈中分配的对象过多,这可能是由于函数的主动调用过多或自动对象使用不当造成的。这种情况可能表明存在错误,例如没有正确退出条件的递归函数调用。

Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossibleto use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJexplicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.

然而,有时需要深度递归调用。在需要支持未绑定活动调用的语言(其中调用深度仅受总内存限制)的实现中,不可能像典型的 C++ 实现那样直接使用(当代)本机调用堆栈作为目标语言激活记录。为了解决这个问题,需要构建活动记录的替代方法。例如,SML/NJ在堆上显式分配帧并使用仙人掌堆栈。此类活动记录帧的复杂分配通常不如调用堆栈帧快。但是,如果在保证适当的尾递归的情况下进一步实现这些语言,对象语言中的直接堆栈分配(即语言中的“对象”不存储为引用,而是可以一对一映射到非共享C++对象的本机原始值)甚至更复杂,更多一般的性能惩罚。在使用 C++ 实现此类语言时,很难估计性能影响。