C语言 比使用 memset 更快的零内存方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3654905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 06:23:21  来源:igfitidea点击:

Faster way to zero memory than with memset?

cstd

提问by maep

I learned that memset(ptr, 0, nbytes)is really fast, but is there a faster way (at least on x86)?

我了解到这memset(ptr, 0, nbytes)真的很快,但是有没有更快的方法(至少在 x86 上)?

I assume that memset uses mov, however when zeroing memory most compilers use xoras it's faster, correct? edit1:Wrong, as GregS pointed out that only works with registers. What was I thinking?

我假设 memset 使用mov,但是当将内存归零时,大多数编译器使用xor它,因为它更快,对吗?编辑 1:错误,正如 GregS 指出的那样,它只适用于寄存器。我在想什么?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

我还请一个比我更了解汇编程序的人查看 stdlib,他告诉我 x86 上的 memset 没有充分利用 32 位宽寄存器。但是当时我很累,所以我不太确定我是否理解正确。

edit2: I revisited this issue and did a little testing. Here is what I tested:

edit2:我重新审视了这个问题并做了一些测试。这是我测试的:

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

results:

结果:

zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.

zero_1 是最慢的,除了 -O3。zero_sizet 是最快的,-O1、-O2 和 -O3 的性能大致相同。memset 总是比 zero_sizet 慢。(-O3 慢两倍)。有趣的一件事是在 -O3 zero_1 与 zero_sizet 一样快。然而,反汇编函数的指令数量大约是其四倍(我认为是由循环展开引起的)。此外,我尝试进一步优化 zero_sizet,但编译器总是胜过我,但这并不奇怪。

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

现在 memset 赢了,之前的结果被 CPU 缓存扭曲了。(所有测试均在 Linux 上运行)需要进一步测试。我接下来会尝试汇编程序:)

edit3:fixed bug in test code, test results are not affected

edit3:修复测试代码中的bug,不影响测试结果

edit4:While poking around the disassembled VS2010 C runtime, I noticed that memsethas a SSE optimized routine for zero. It will be hard to beat this.

编辑 4:在浏览反汇编的 VS2010 C 运行时时,我注意到它memset有一个针对零的 SSE 优化例程。很难打败这个。

采纳答案by Tim

x86 is rather broad range of devices.

x86 是相当广泛的设备。

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

对于完全通用的 x86 目标,带有“rep movsd”的汇编块可以一次将 0 输出到 32 位内存。尽量确保这项工作的大部分是 DWORD 对齐的。

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

对于带有 mmx 的芯片,带有 movq 的装配循环一次可以达到 64 位。

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

您也许可以让 C/C++ 编译器使用 64 位写入,并带有指向 long long 或 _m64 的指针。目标必须是 8 字节对齐以获得最佳性能。

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

对于带有 sse 的芯片,movaps 很快,但前提是地址是 16 字节对齐的,所以使用 movsb 直到对齐,然后用 movaps 循环完成清除

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

Win32 有“ZeroMemory()”,但我忘了那是 memset 的宏还是实际的“好”实现。

回答by Ben Zotto

memsetis generally designed to be very very fast general-purposesetting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).

memset通常被设计为非常非常快速的通用设置/归零代码。它处理具有不同大小和对齐方式的所有情况,这些情况会影响您可以用来完成工作的指令类型。根据您使用的系统(以及您的 stdlib 来自哪个供应商),底层实现可能在特定于该架构的汇编程序中,以利用其本机属性。它也可能有内部特殊情况来处理归零的情况(相对于设置一些其他值)。

That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memsetimplementation by doing it yourself. memsetand its friends in the standard library are always fun targets for one-upmanship programming. :)

也就是说,如果您有非常具体、非常关键的内存归零要做,那么您当然有可能memset通过自己完成特定的实现。memset和它在标准库中的朋友总是一个高手编程的有趣目标。:)

回答by Jens Gustedt

Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memsetaway (better check the assembler, though).

现在你的编译器应该为你做所有的工作。至少我所知道的 gcc 在优化对memsetaway 的调用方面非常有效(不过最好检查一下汇编程序)。

Then also, avoid memsetif you don't have to:

然后,memset如果您不需要,请避免:

  • use calloc for heap memory
  • use proper initialization (... = { 0 }) for stack memory
  • 使用 calloc 作为堆内存
  • ... = { 0 }为堆栈内存使用正确的初始化 ( )

And for really large chunks use mmapif you have it. This just gets zero initialized memory from the system "for free".

mmap如果你有的话,对于非常大的块使用。这只是“免费”从系统中获得零初始化内存。

回答by Sparky

If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.

如果我没记错的话(几年前),一位高级开发人员正在谈论一种在 PowerPC 上实现 bzero() 的快速方法(规范说我们需要在通电时将几乎所有内存归零)。它可能无法很好地(如果有的话)转换为 x86,但它可能值得探索。

The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.

这个想法是加载一个数据缓存行,清除该数据缓存行,然后将清除的数据缓存行写回内存。

For what it is worth, I hope it helps.

对于它的价值,我希望它有所帮助。

回答by snemarch

Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.

除非您有特定需求或知道您的编译器/标准库很烂,否则请坚持使用 memset。它是通用的,并且总体上应该具有不错的性能。此外,编译器可能更容易优化/内联 memset(),因为它可以对它提供内在的支持。

For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a callto the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.

例如,Visual C++ 通常会生成内联版本的 memcpy/memset,它们与对库函数的调用一样小,从而避免 push/call/ret 开销。当可以在编译时评估 size 参数时,还有进一步可能的优化。

That said, if you have specificneeds (where size will always be tiny*or*huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.

也就是说,如果您有特定需求(其中尺寸总是很小* 或 *巨大),您可以通过降低到装配级别来获得速度提升。例如,使用直写操作将大量内存归零而不污染 L2 缓存。

But it all depends - and for normal stuff, please stick to memset/memcpy :)

但这一切都取决于 - 对于正常的东西,请坚持使用 memset/memcpy :)

回答by Chris

There is one fatal flaw in this otherwise great and helpful test: As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow. Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!

这个本来很棒而且很有帮助的测试有一个致命的缺陷:因为 memset 是第一条指令,所以似乎有一些“内存开销”左右,这使得它非常慢。将 memset 的时间移到第二位,将其他东西移到第一位,或者简单地将 memset 计时两次使 memset 成为所有编译开关中最快的!!!

回答by Chris

That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset()in multithreaded scenarios.

这是一个有趣的问题。我做了这个实现,在 VC++ 2012 上编译 32 位版本时,它只是稍微快一点(但几乎无法衡量)。它可能可以改进很多。在多线程环境中将它添加到您自己的类中可能会给您带来更多的性能提升,因为memset()在多线程场景中存在一些报告的瓶颈问题。

// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>

#pragma comment(lib, "Winmm.lib") 
using namespace std;

/** a signed 64-bit integer value type */
#define _INT64 __int64

/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 16-bit integer value type */
#define _INT16 __int16

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64

/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32

/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8

/** maximum allo

wed value in an unsigned 64-bit integer value type */
    #define _UINT64_MAX 18446744073709551615ULL

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;

/** Use to start the performance timer */
#define TIMER_START start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif    


void *MemSet(void *dest, _UINT8 c, size_t count)
{
    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);
    _UINT64 cUll = 
        c 
        | (((_UINT64)c) << 8 )
        | (((_UINT64)c) << 16 )
        | (((_UINT64)c) << 24 )
        | (((_UINT64)c) << 32 )
        | (((_UINT64)c) << 40 )
        | (((_UINT64)c) << 48 )
        | (((_UINT64)c) << 56 );

    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;

    if (!bytesLeft) return dest;

    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;

    return dest;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    const size_t n = 10000000;
    const _UINT64 m = _UINT64_MAX;
    const _UINT64 o = 1;
    char test[n];
    {
        cout << "memset()" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                memset((void*)test, 0, n);  

        TIMER_STOP;
    }
    {
        cout << "MemSet() took:" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                MemSet((void*)test, 0, n);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

Output is as follows when release compiling for 32-bit systems:

32位系统发布编译输出如下:

memset() took:
5.569000
MemSet() took:
5.544000
Done

Output is as follows when release compiling for 64-bit systems:

64位系统发布编译输出如下:

memset() took:
2.781000
MemSet() took:
2.765000
Done

Here you can findthe source code Berkley's memset(), which I think is the most common implementation.

在这里你可以找到Berkley's 的源代码memset(),我认为这是最常见的实现。

回答by bta

The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.

memset 函数被设计为灵活和简单,即使以速度为代价。在许多实现中,它是一个简单的 while 循环,它在给定的字节数上一次一个字节地复制指定的值。如果您想要更快的 memset(或 memcpy、memmove 等),几乎总是可以自己编写代码。

The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.

最简单的定制是执行单字节“设置”操作,直到目标地址为 32 位或 64 位对齐(无论与您的芯片架构相匹配),然后开始一次复制一个完整的 CPU 寄存器。如果您的范围没有以对齐的地址结束,您可能需要在最后执行几个单字节“设置”操作。

Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.

根据您的特定 CPU,您可能还有一些流式 SIMD 指令可以帮助您。这些通常在对齐地址上效果更好,因此上述使用对齐地址的技术在这里也很有用。

For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).

为了将大段内存清零,您还可以通过将范围拆分为多个部分并并行处理每个部分(其中部分的数量与您的数量或内核/硬件线程数相同)来提高速度。

Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).

最重要的是,除非您尝试,否则无法判断其中任何一项是否会有所帮助。至少,看看你的编译器为每种情况发出什么。查看其他编译器为其标准“memset”发出的内容(它们的实现可能比您的编译器更有效)。

回答by SmugLispWeenie

memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.

memset 可以被编译器内联为一系列有效的操作码,展开几个周期。对于非常大的内存块,例如 4000x2000 64 位帧缓冲区,您可以尝试跨多个线程(您为该唯一任务准备)对其进行优化,每个线程设置自己的部分。注意也有bzero(),但是比较晦涩,不太可能像memset那样优化,编译器肯定会注意到你传了0。

What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.

编译器通常假设,你 memset 大块,所以对于较小的块*(uint64_t*)p = 0,如果你初始化大量的小对象,它可能会更有效。

Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.

通常,所有 x86 CPU 都是不同的(除非您针对某个标准化平台进行编译),并且您针对 Pentium 2 优化的某些内容在 Core Duo 或 i486 上的表现会有所不同。因此,如果您真的很喜欢它并想挤出最后几滴牙膏,那么提供针对不同流行 CPU 型号编译和优化的多个版本的 exe 是有意义的。根据个人经验,与没有 -march 相比,Clang -march=native 将我的游戏的 FPS 从 60 提高到 65。