C语言为什么 malloc+memset 比 calloc 慢？

Question

提问by kingkai

It's known that callocis different than mallocin that it initializes the memory allocated. With calloc, the memory is set to zero. With malloc, the memory is not cleared.

众所周知，这与初始化分配的内存calloc不同malloc。使用calloc，内存设置为零。使用时malloc，不会清除内存。

So in everyday work, I regard callocas malloc+memset. Incidentally, for fun, I wrote the following code for a benchmark.

所以在日常工作中，我认为calloc是malloc+ memset。顺便说一句，为了好玩，我编写了以下代码作为基准测试。

The result is confusing.

结果令人困惑。

Code 1:

代码 1：

#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)calloc(1,BLOCK_SIZE);
                i++;
        }
}

Output of Code 1:

代码 1 的输出：

time ./a.out  
**real 0m0.287s**  
user 0m0.095s  
sys 0m0.192s

Code 2:

代码 2：

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)malloc(BLOCK_SIZE);
                memset(buf[i],'time ./a.out   
**real 0m2.693s**  
user 0m0.973s  
sys 0m1.721s  
',BLOCK_SIZE);
                i++;
        }
}

Output of Code 2:

代码 2 的输出：

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

Replacing memsetwith bzero(buf[i],BLOCK_SIZE)in Code 2 produces the same result.

替换代码 2 中的memsetwithbzero(buf[i],BLOCK_SIZE)会产生相同的结果。

My question is:Why is malloc+memsetso much slower than calloc? How can callocdo that?

我的问题是：为什么malloc+memset比calloc?慢这么多？怎么能calloc这样？

Answer 1

回答by Dietrich Epp

The short version: Always use calloc()instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc()will do less work because it can skip memset()entirely. In other cases, calloc()can even cheat and not allocate any memory! However, malloc()+memset()will always do the full amount of work.

简短版本：始终使用calloc()而不是malloc()+memset(). 在大多数情况下，它们是相同的。在某些情况下，calloc()会做更少的工作，因为它可以memset()完全跳过。在其他情况下，calloc()甚至可以作弊而不分配任何内存！但是，malloc()+memset()总是会做足量的工作。

Understanding this requires a short tour of the memory system.

理解这一点需要对内存系统进行简短的浏览。

Quick tour of memory

快速浏览记忆

There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...

这里有四个主要部分：你的程序、标准库、内核和页表。你已经知道你的程序了，所以...

Memory allocators like malloc()and calloc()are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc()will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc()and calloc()will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.

内存分配器喜欢malloc()并且calloc()主要在那里进行小分配（从 1 字节到 100 KB 的任何内容）并将它们分组到更大的内存池中。例如，如果您分配 16 个字节，malloc()将首先尝试从它的一个池中获取 16 个字节，然后在池耗尽时从内核请求更多内存。但是，由于您要询问的程序正在一次分配大量内存，malloc()并且calloc()只会直接从内核请求该内存。此行为的阈值取决于您的系统，但我已经看到 1 MiB 用作阈值。

The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection,it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap()or sbrk(). The kernel will give RAM to each process by modifying the page table.

内核负责为每个进程分配实际的 RAM，并确保进程不会干扰其他进程的内存。这称为内存保护，自 1990 年代以来一直很常见，这就是为什么一个程序可以崩溃而不会使整个系统崩溃的原因。因此，当程序需要更多内存时，它不能只占用内存，而是使用像mmap()or 之类的系统调用从内核请求内存sbrk()。内核会通过修改页表给每个进程分配内存。

The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory.The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.

页表将内存地址映射到实际的物理 RAM。您的进程地址，在 32 位系统上为 0x00000000 到 0xFFFFFFFF，不是真实内存，而是虚拟内存中的地址。处理器将这些地址划分为 4 KiB 页，通过修改页表可以将每个页分配到不同的物理 RAM 中。只有内核被允许修改页表。

How it doesn't work

它如何不起作用

Here's how allocating 256 MiB does notwork:

以下是如何分配256 MIB并不能正常工作：

Your process calls calloc()and asks for 256 MiB.
The standard library calls mmap()and asks for 256 MiB.
The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.
The standard library zeroes the RAM with memset()and returns from calloc().
Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.

您的进程调用calloc()并要求 256 MiB。
标准库调用mmap()并要求 256 MiB。
内核找到 256 MiB 未使用的 RAM，并通过修改页表将其提供给您的进程。
标准库将 RAM 清零memset()并从返回calloc()。
您的进程最终会退出，并且内核会回收 RAM，以便其他进程可以使用它。

How it actually works

它是如何工作的

The above process would work, but it just doesn't happen this way. There are three major differences.

上述过程会起作用，但它不会以这种方式发生。有三个主要区别。

When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap()guarantees that the new memory it returns is always zeroed.
There are a lot of programs out there that allocate memory but don't use the memory right away. Some times memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page faultand the kernel steps in assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.
Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

当您的进程从内核获取新内存时，该内存可能之前已被其他某个进程使用。这是一个安全风险。如果该内存有密码、加密密钥或秘密莎莎食谱怎么办？为了防止敏感数据泄漏，内核总是在将内存提供给进程之前清理内存。我们不妨通过清零来清理内存，如果新内存清零，我们也可以保证它，因此mmap()保证它返回的新内存始终为零。
有很多程序会分配内存但不会立即使用内存。有时分配了内存但从未使用过。内核知道这一点并且很懒惰。当您分配新内存时，内核根本不接触页表，也不为您的进程提供任何 RAM。相反，它会在您的进程中找到一些地址空间，记下应该去那里的内容，并承诺如果您的程序实际使用它，它将把 RAM 放在那里。当您的程序尝试从这些地址读取或写入时，处理器会触发页面错误，内核会逐步将 RAM 分配给这些地址并恢复您的程序。如果你从不使用内存，页面错误就永远不会发生，你的程序也永远不会真正获得 RAM。
一些进程分配内存，然后读取它而不修改它。这意味着跨不同进程的内存中的许多页面可能会填充从mmap(). 由于这些页面都是相同的，内核使所有这些虚拟地址指向一个由零填充的共享 4 KiB 内存页面。如果您尝试写入该内存，则处理器会触发另一个页面错误，并且内核会介入，为您提供一个新的零页面，该页面不与任何其他程序共享。

The final process looks more like this:

最终的过程看起来更像这样：

Your process calls calloc()and asks for 256 MiB.
The standard library calls mmap()and asks for 256 MiB.
The kernel finds 256 MiB of unused address space,makes a note about what that address space is now used for, and returns.
The standard library knows that the result of mmap()is always filled with zeroes (or will beonce it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.
Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.

您的进程调用calloc()并要求 256 MiB。
标准库调用mmap()并要求 256 MiB。
内核找到 256 MiB 未使用的地址空间，记下该地址空间现在的用途，然后返回。
标准库知道的结果mmap()总是充满着零（或将是，一旦它实际上得到一些RAM），所以它不会触碰内存，所以不存在缺页，并且RAM永远不会给你的进程.
您的进程最终会退出，并且内核不需要回收 RAM，因为它从未首先分配过。

If you use memset()to zero the page, memset()will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc()is faster than malloc()and memset(). If end up using the memory anyway, calloc()is still faster than malloc()and memset()but the difference is not quite so ridiculous.

如果您使用memset()将页面归零，memset()将触发页面错误，导致 RAM 被分配，然后将其归零，即使它已经被零填充。这是大量的额外工作，并解释了为什么calloc()比malloc()和快memset()。如果无论如何最终使用内存，calloc()仍然比malloc()和快，memset()但差异并不那么荒谬。

This doesn't always work

这并不总是有效

Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.

并非所有系统都有分页虚拟内存，因此并非所有系统都可以使用这些优化。这适用于非常老的处理器，如 80286 以及对于复杂的内存管理单元来说太小的嵌入式处理器。

This also won't always work with smaller allocations. With smaller allocations, calloc()gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc()could take that memory and call memset()to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.

这也并不总是适用于较小的分配。使用较小的分配，calloc()从共享池中获取内存而不是直接进入内核。通常，共享池中可能存储了来自使用和释放的旧内存的垃圾数据free()，因此calloc()可以使用该内存并调用memset()将其清除。通用实现将跟踪共享池的哪些部分是原始的并且仍然填充零，但并非所有实现都这样做。

Dispelling some wrong answers

消除一些错误的答案

Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages durign idle isn't enough to explain the large performance differences anyway.

根据操作系统的不同，内核可能会或可能不会在空闲时间将内存归零，以防您稍后需要获得一些归零的内存。Linux 不会提前清零内存，并且Dragonfly BSD 最近也从其内核中删除了此功能。但是，其他一些内核会提前执行零内存。无论如何，在空闲状态下将页面归零不足以解释巨大的性能差异。

The calloc()function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset()implementations for modern processors look kind of like this:

该calloc()函数没有使用一些特殊的内存对齐版本memset()，无论如何这不会使它更快。memset()现代处理器的大多数实现看起来像这样：

##代码##

So you can see, memset()is very fast and you're not really going to get anything better for large blocks of memory.

所以你可以看到，memset()速度非常快，对于大内存块，你真的不会得到任何更好的东西。

The fact that memset()is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset()and calloc()).

将memset()已经归零的内存归零这一事实确实意味着内存被归零两次，但这只能解释 2 倍的性能差异。这里的性能差异要大得多（我在我的系统上测量了malloc()+memset()和之间的三个数量级以上calloc()）。

Party trick

派对把戏

Instead of looping 10 times, write a program that allocates memory until malloc()or calloc()returns NULL.

不要循环 10 次，而是编写一个分配内存的程序，直到malloc()或calloc()返回 NULL。

What happens if you add memset()?

如果添加会发生什么memset()？

Answer 2

回答by Chris Lutz

Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.

因为在许多系统上，在空闲处理时间内，操作系统会自行将空闲内存设置为零并将其标记为安全calloc()，因此当您调用时calloc()，它可能已经有空闲的零内存给您。

Answer 3

回答by Stewart

On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

在某些模式下的某些平台上， malloc 在返回内存之前将内存初始化为某个通常为非零的值，因此第二个版本可以很好地将内存初始化两次

C语言为什么 malloc+memset 比 calloc 慢？

提问by kingkai

回答by Dietrich Epp

Quick tour of memory

快速浏览记忆

How it doesn't work

它如何不起作用

How it actually works

它是如何工作的

This doesn't always work

这并不总是有效

Dispelling some wrong answers

消除一些错误的答案

Party trick

派对把戏

回答by Chris Lutz

回答by Stewart

相关推荐

最近更新

标签

C语言 为什么 malloc+memset 比 calloc 慢？

提问by kingkai

回答by Dietrich Epp

Quick tour of memory

快速浏览记忆

How it doesn't work

它如何不起作用

How it actually works

它是如何工作的

This doesn't always work

这并不总是有效

Dispelling some wrong answers

消除一些错误的答案

Party trick

派对把戏

回答by Chris Lutz

回答by Stewart

相关推荐

C语言 将参数传递给 pthread

C语言 我可以使用什么预定义的宏来检测叮当声？

C语言 我的变量存储在 C 中的内存中的哪个位置？

C语言 使用 C 设计 GUI 应用程序的最佳方法是什么？

相关推荐

最近更新

标签

C语言为什么 malloc+memset 比 calloc 慢？

C语言将参数传递给 pthread

C语言我可以使用什么预定义的宏来检测叮当声？

C语言我的变量存储在 C 中的内存中的哪个位置？

C语言使用 C 设计 GUI 应用程序的最佳方法是什么？