C++ 访问堆中的数据是否比从堆栈中访问更快?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24057331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 00:37:49  来源:igfitidea点击:

Is accessing data in the heap faster than from the stack?

c++cperformancestackheap

提问by conectionist

I know this sounds like a general question and I've seen many similar questions (both here and on the web) but none of them are really like my dilemma.

我知道这听起来像是一个笼统的问题,我见过很多类似的问题(在这里和网上都有),但没有一个真正像我的困境。

Say I have this code:

说我有这个代码:

void GetSomeData(char* buffer)
{
    // put some data in buffer
}

int main()
{
     char buffer[1024];
     while(1)
     {
          GetSomeData(buffer);
          // do something with the data
     }
     return 0;
}

Would I gain any performance if I declared buffer[1024] globally?

如果我全局声明 buffer[1024] 会获得任何性能吗?

I ran some tests on unix via the time command and there are virtually no differences between the execution times.

我通过 time 命令在 unix 上运行了一些测试,执行时间之间几乎没有差异。

But I'm not really convinced...

但我真的不相信......

In theory should this change make a difference?

从理论上讲,这种变化应该有所作为吗?

采纳答案by Tony Delroy

Is accessing data in the heap faster than from the stack?

访问堆中的数据是否比从堆栈中访问更快?

Not inherently... on every architecture I've ever worked on, all the process "memory" can be expected to operate at the same set of speeds, based on which level of CPU cache / RAM / swap file is holding the current data, and any hardware-level synchronisation delays that operations on that memory may trigger to make it visible to other processes, incorporate other processes'/CPU (core)'s changes etc..

并非天生......在我曾经使用过的每个架构上,所有进程“内存”都可以期望以相同的速度运行,这取决于 CPU 缓存/RAM/交换文件的级别是保存当前数据,以及该内存上的操作可能触发的任何硬件级同步延迟,以使其对其他进程可见,合并其他进程/CPU(核心)的更改等。

The OS (which is responsible for page faulting / swapping), and the hardware (CPU) trapping on accesses to swapped-out or not-yet-accessed pages, would not even be tracking which pages are "stack" vs "heap"... a memory page is a memory page. That said, the virtual address of global data may be able to be calculated and hardcoded at compile time, the addresses of stack-based data are typically stack-pointer relative, while memory on the heap must almost always be accessed using pointers, which might be slightly slower on some systems - it depends on the CPU addressing modes and cycles, but it's almost always insignificant - not even worth a look or second thought unless you're writing something where millionths of a second are enormously important.

操作系统(负责页面错误/交换)和硬件(CPU)捕获对换出或尚未访问的页面的访问,甚至不会跟踪哪些页面是“堆栈”与“堆”。 .. 内存页是内存页。也就是说,全局数据的虚拟地址可以在编译时计算和硬编码,基于堆栈的数据的地址通常是相对于堆栈指针的,而堆上的内存几乎总是必须使用指针访问,这可能在某些系统上稍微慢一点 - 这取决于 CPU 寻址模式和周期,但它几乎总是微不足道的 - 甚至不值得一看或再三考虑,除非你正在写一些百万分之一秒非常重要的东西。

Anyway, in your example you're contrasting a global variable with a function-local (stack/automatic) variable... there's no heap involved. Heap memory comes from newor malloc/realloc. For heap memory, the performance issue worth noting is that the application itself is keeping track of how much memory is in use at which addresses - the records of all that take some time to update as pointers to memory are handed out by new/malloc/realloc, and some more time to update as the pointers are deleted or freed.

无论如何,在您的示例中,您将全局变量与函数局部(堆栈/自动)变量进行对比……不涉及堆。堆内存来自newmalloc/ realloc。对于堆内存,值得注意的性能问题是应用程序本身会跟踪在哪些地址上使用了多少内存 - 所有需要一些时间更新的记录,因为指向内存的指针由new//malloc分发realloc,并且由于指针是deleted 或freed ,所以需要更多时间来更新。

For global variables, the allocation of memory may effectively be done at compile time, while for stack based variables there's normally a stack pointer that's incremented by the compile-time-calculated sum of the sizes of local variables (and some housekeeping data) each time a function is called. So, when main()is called there may be some time to modify the stack pointer, but it's probably just being modified by a different amount rather than not modified if there's no bufferand modified if there is, so there's no difference in runtime performance at all.

对于全局变量,内存的分配可以在编译时有效地完成,而对于基于堆栈的变量,通常有一个堆栈指针,每次都由编译时计算的局部变量(和一些内务处理数据)的大小总和递增一个函数被调用。所以,当main()被调用时,可能有一些时间来修改堆栈指针,但它可能只是被修改了不同的数量,而不是如果没有就没有buffer修改,如果有就修改,所以运行时性能根本没有区别。

回答by haccks

Quoting from Jeff Hill's answer:

引用Jeff Hill 的回答

The stack is fasterbecause the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or free. Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, i.e. each allocation and deallocation needs to be - typically - synchronized with "all" other heap accesses in the program.

堆栈更快,因为访问模式使得从中分配和释放内存变得微不足道(指针/整数只是递增或递减),而堆在分配或释放中涉及更复杂的簿记。此外,堆栈中的每个字节往往会被非常频繁地重用,这意味着它往往会映射到处理器的缓存,从而使其速度非常快。堆的另一个性能影响是堆,主要是全局资源,通常必须是多线程安全的,即每个分配和释放需要 - 通常 - 与程序中的“所有”其他堆访问同步。

enter image description here

在此处输入图片说明

回答by Madars Vi

There is blog post available on this topic stack-allocation-vs-heap-allocation-performance-benchmarkWhich shows the allocation strategies benchmark. Test is written in C and performs compare between pure allocation attempts, and allocation with memory init. At different total data sizes, number of loops are performed and time is measured. Each allocation consists of 10 different alloc/init/free blocks with different sizes (total size shown in charts).

有关于此主题的博客文章stack-allocation-vs-heap-allocation-performance-benchmark显示了分配策略基准。测试是用 C 编写的,并在纯分配尝试和使用内存初始化的分配之间执行比较。在不同的总数据大小下,执行循环次数并测量时间。每个分配由 10 个不同大小的 alloc/init/free 块组成(总大小如图表所示)。

Test are run on Intel(R) Core(TM) i7-6600U CPU, Linux 64 bit, 4.15.0-50-generic, Spectre and Meltdown patches disabled.

测试在 Intel(R) Core(TM) i7-6600U CPU、Linux 64 位、4.15.0-50-generic、Spectre 和 Meltdown 补丁禁用上运行。

With out init: Memory allocation with out data init

没有初始化: 没有数据初始化的内存分配

With init: Memory allocations with data init

使用初始化: 使用数据初始化分配内存

In the result we see that there is significant difference in pure allocations with out data init. The stack is faster than heap, but take a note that loop count is ultra high.

在结果中,我们看到没有数据初始化的纯分配有显着差异。堆栈比堆快,但请注意循环计数非常高。

When allocated data is being processed, the gap between stack & heap performance seems to reduce. At 1M malloc/init/free (or stack alloc) loops with 10 allocation attempts at each loop, stack is only 8% ahead of heap in terms of total time.

在处理分配的数据时,堆栈和堆性能之间的差距似乎缩小了。在 1M malloc/init/free(或堆栈分配)循环中,每个循环尝试分配 10 次,就总时间而言,堆栈仅比堆领先 8%。

回答by James Kanze

Your question doesn't really have an answer; it depends on what else you are doing. Generally speaking, most machines use the same "memory" structure over the entire process, so regardless of where (heap, stack or global memory) the variable resides, access time will be identical. On the other hand, most modern machines have a hierarchial memory structure, with a memory pipeline, several levels of cache, main memory, and virtual memory. Depending on what has gone on previously on the processor, the actual access may be to any one of these (regardless of whether it is heap, stack or global), and the access times here vary enormously, from a single clock if the memory is in the right place in the pipeline, to something around 10 milliseconds if the system has to go to virtual memory on disk.

你的问题并没有真正的答案;这取决于你还在做什么。一般来说,大多数机器在整个进程中使用相同的“内存”结构,因此无论变量驻留在何处(堆、栈或全局内存),访问时间都是相同的。另一方面,大多数现代机器具有分层内存结构,具有内存管道、多级缓存、主内存和虚拟内存。根据处理器之前发生的情况,实际访问可能是其中任何一个(无论是堆、堆栈还是全局),并且这里的访问时间差异很大,从单个时钟开始,如果内存是在管道中的正确位置,如果系统必须转到磁盘上的虚拟内存,则大约需要 10 毫秒。

In all cases, the key is locality. If an access is "near" a previous access, you greatly improve the chance of finding it in one of the faster locations: cache, for example. In this regard, putting smaller objects on the stack may be faster, because when you access the arguments of a function, you're access on stack memory (with an Intel 32-bit processor, at least---with better designed processors, arguments are more likely to be in registers). But this will probably not be an issue when an array is involved.

在所有情况下,关键是局部性。如果访问“接近”先前的访问,则可以大大提高在更快的位置之一中找到它的机会:例如,缓存。在这方面,将较小的对象放在堆栈上可能会更快,因为当您访问函数的参数时,您正在访问堆栈内存(至少使用英特尔 32 位处理器——使用设计更好的处理器,参数更有可能在寄存器中)。但是当涉及到数组时,这可能不是问题。

回答by bobah

when allocating buffers on stack the optimization scope is not the cost of accessing the memory but rather the elimination of often very expensive dynamic memory allocation on the heap (stack buffer allocation can be considered instantaneous as the stack as a whole is allocated at thread startup).

在堆栈上分配缓冲区时,优化范围不是访问内存的成本,而是消除堆上通常非常昂贵的动态内存分配(堆栈缓冲区分配可以被认为是瞬时的,因为整个堆栈是在线程启动时分配的) .

回答by Gumby The Green

For what it's worth, the loop in the code below - which just reads from and writes to each element in a big array - consistently runs 5x faster on my machine when the array is on the stack vs when it's on the heap (GCC, Windows 10, -O3 flag), even right after a reboot (when heap fragmentation is minimized):

对于它的价值,下面代码中的循环——它只是读取和写入大数组中的每个元素——当数组在堆栈上时,它在我的机器上的运行速度始终比在堆上时快 5 倍(GCC,Windows 10, -O3 标志),甚至在重新启动后(当堆碎片最小化时):

const int size = 100100100;
int vals[size]; // STACK
// int *vals = new int[size]; // HEAP
startTimer();
for (int i = 1; i < size; ++i) {
    vals[i] = vals[i - 1];
}
stopTimer();
std::cout << vals[size - 1];
// delete[] vals; // HEAP

Of course, I first had to increase the stack size to 400 MB. Note that the printing of the last element at the end is needed to keep the compiler from optimizing everything away.

当然,我首先必须将堆栈大小增加到 400 MB。请注意,需要在末尾打印最后一个元素,以防止编译器优化所有内容。

回答by SuperAgenten Johannes Schaeder

Giving that variables and variable arrays that are declared on the heap is slower is just a fact. Think about it this way;

给出在堆上声明的变量和变量数组较慢只是一个事实。这样想;

Globally created variables are allocated once and deallocated once the program is closing. For a heap object your variable has to be allocated on the spot each time the function is ran, and deallocated in the end of the function..

全局创建的变量被分配一次,一旦程序关闭就被释放。对于堆对象,每次运行函数时都必须在现场分配变量,并在函数结束时释放变量。

Ever tried allocating an object pointer within a function? Well better free / delete it before the function exits, orelse you will have yourself a memoryleak giving that you are not doing this in a class object where it is free'd/deleted inside the deconstructor.

有没有试过在函数内分配一个对象指针?最好在函数退出之前释放/删除它,否则你将有自己的内存泄漏,因为你没有在类对象中这样做,它在解构器中被释放/删除。

When it comes to accessing of an array they all work the same, a memory block is first allocated by sizeof(DataType) * elements. Later can be accessed by ->

在访问数组时,它们的工作方式都相同,首先通过 sizeof(DataType) * 元素分配内存块。以后可以通过->访问

1 2 3 4 5 6 
^ entry point [0]
      ^ entry point [0]+3