C语言 如何提高 memcpy 的性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4260602/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 07:10:22  来源:igfitidea点击:

How to increase performance of memcpy

cvisual-studiomemcpycvimemory-bandwidth

提问by leecbaker

Summary:

概括:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

memcpy 似乎无法在真实或测试应用程序中在我的系统上传输超过 2GB/秒。我该怎么做才能获得更快的内存到内存副本?

Full details:

完整细节:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

作为数据捕获应用程序的一部分(使用某些专用硬件),我需要将大约 3 GB/秒的速度从临时缓冲区复制到主内存中。为了获取数据,我为硬件驱动程序提供了一系列缓冲区(每个缓冲区 2MB)。硬件 DMA 将数据发送到每个缓冲区,然后在每个缓冲区已满时通知我的程序。我的程序清空缓冲区(memcpy 到另一个更大的 RAM 块),并将处理后的缓冲区重新发布到卡上以再次填充。我在 memcpy 移动数据的速度不够快时遇到问题。似乎内存到内存的复制应该足够快,可以在我运行的硬件上支持 3GB/秒。Lavalys EVEREST 为我提供了 9337MB/秒的内存复制基准测试结果,但我无法使用 memcpy 接近这些速度,即使在一个简单的测试程序中也是如此。

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

我通过在缓冲区处理代码中添加/删除 memcpy 调用来隔离性能问题。没有 memcpy,我可以运行完整的数据速率 - 大约 3GB/秒。启用 memcpy 后,我被限制在大约 550Mb/秒(使用当前编译器)。

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

为了在我的系统上对 memcpy 进行基准测试,我编写了一个单独的测试程序,它只对某些数据块调用 memcpy。(我已经在下面发布了代码)我已经在我正在使用的编译器/IDE(National Instruments CVI)以及 Visual Studio 2010 中运行了它。虽然我目前没有使用 Visual Studio,但我愿意进行切换,如果它会产生必要的性能。但是,在盲目移动之前,我想确保它可以解决我的 memcpy 性能问题。

Visual C++ 2010: 1900 MB/sec

Visual C++ 2010:1900 MB/秒

NI CVI 2009: 550 MB/sec

NI CVI 2009:550 MB/秒

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

虽然我对 CVI 明显慢于 Visual Studio 并不感到惊讶,但我对 memcpy 性能如此之低感到惊讶。虽然我不确定这是否可以直接比较,但这比 EVEREST 基准带宽要低得多。虽然我不需要那么高的性能,但至少需要 3GB/秒。当然,标准库的实现不会比 EVEREST 使用的任何东西差这么多!

What, if anything, can I do to make memcpy faster in this situation?

在这种情况下,我可以做些什么来使 memcpy 更快?



Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64

硬件详细信息:AMD Magny Cours- 4x 八核 128 GB DDR3 Windows Server 2003 Enterprise X64

Test program:

测试程序:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

编辑:如果你有额外的五分钟时间并想做出贡献,你能在你的机器上运行上面的代码并将你的时间作为评论发表吗?

采纳答案by leecbaker

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

在这种情况下,我找到了一种提高速度的方法。我写了一个多线程版本的memcpy,把要在线程之间复制的区域分开。以下是设置块大小的一些性能缩放数字,使用与上述相同的时序代码。我不知道性能,特别是对于这么小的块,会扩展到这么多线程。我怀疑这与这台机器上的大量内存控制器(16)有关。

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

我不明白 3 和 4 个线程之间的巨大性能跳跃。什么会导致这样的跳跃?

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

我已经包含了我在下面为可能遇到相同问题的其他人编写的 memcpy 代码。请注意,此代码中没有错误检查 - 这可能需要为您的应用程序添加。

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

回答by onemasse

I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.

我不确定它是在运行时完成还是必须在编译时完成,但是您应该启用 SSE 或类似的扩展,因为与 CPU 的 64 位相比,向量单元通常可以将 128 位写入内存。

Try this implementation.

试试这个实现

Yeah, and make sure that boththe source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)

是啊,并确保双方的源和目标对准到128位。如果你的源和目标没有相互对齐,你的 memcpy() 将不得不做一些严重的魔法。:)

回答by Skizz

You have a few barriers to obtaining the required memory performance:

获得所需的内存性能有一些障碍:

  1. Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)

  2. The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!

  3. The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.

  1. 带宽 - 数据从内存移动到 CPU 再返回的速度是有限制的。根据这篇维基百科文章,266MHz DDR3 RAM 的上限约为 17GB/s。现在,使用 memcpy,您需要将其减半以获得最大传输速率,因为数据被读取然后写入。从您的基准测试结果来看,您似乎没有在系统中运行最快的 RAM。如果你负担得起,升级主板/内存(它不会便宜,英国的超频者目前有 3x4GB PC16000 £400)

  2. 操作系统 - Windows 是一个抢占式多任务操作系统,因此您的进程经常会暂停,以允许其他进程查看并执行操作。这将破坏您的缓存并阻止您的传输。在最坏的情况下,您的整个过程可能会缓存到磁盘!

  3. CPU - 被移动的数据还有很长的路要走:RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM。甚至可能有一个 L3 缓存。如果您想涉及 CPU,您真的希望在复制 L1 的同时加载 L2。不幸的是,现代 CPU 可以比加载 L1 所需的时间更快地运行 L1 缓存块。CPU 有一个内存控制器,在这些情况下,您的数据按顺序流入 CPU,但您仍然会遇到问题时,它会很有帮助。

Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:

当然,做某事更快的方法是不做。捕获的数据是否可以写入 RAM 中的任何位置,或者是在固定位置使用的缓冲区。如果您可以在任何地方编写它,那么您根本不需要 memcpy。如果它是固定的,您能否就地处理数据并使用双缓冲区类型系统?也就是说,开始捕获数据,当它半满时,开始处理数据的前半部分。当缓冲区已满时,开始将捕获的数据写入开始并处理后半部分。这要求算法处理数据的速度比采集卡产生的数据快。它还假设数据在处理后被丢弃。实际上,这是一个将转换作为复制过程的一部分的 memcpy,因此您有:

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

instead of:

代替:

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

Or get faster RAM!

或者获得更快的内存!

EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.

编辑:另一种选择是在数据源和 PC 之间处理数据 - 你能在那里放一个 DSP / FPGA 吗?定制硬件总是比通用 CPU 快。

Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.

另一个想法:我已经有一段时间没有做任何高性能图形的东西了,但是你能把数据 DMA 到显卡中,然后再 DMA 出来吗?您甚至可以利用 CUDA 进行一些处理。这将使 CPU 完全脱离内存传输循环。

回答by Michael Burr

One thing to be aware of is that your process (and hence the performance of memcpy()) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).

需要注意的一件事是,您的进程(以及 的性能memcpy())受到操作系统任务调度的影响 - 很难说这在您的时间安排中有多少因素,但很难控制。设备 DMA 操作不受此限制,因为它一旦启动就不会在 CPU 上运行。不过,由于您的应用程序是一个实际的实时应用程序,如果您还没有尝试过 Windows 的进程/线程优先级设置,则可能需要进行试验。请记住,您必须小心这一点,因为它会对其他进程(以及机器上的用户体验)产生真正的负面影响。

Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy()operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy()is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc()API to ensure that your destination memory for the memcpy()is committed (I think VirtualAlloc()is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).

要记住的另一件事是,操作系统内存虚拟化可能会在这里产生影响 - 如果您复制到的内存页面实际上没有由物理 RAM 页面支持,则memcpy()操作将错误地指向操作系统以获取该物理支持地方。您的 DMA 页很可能被锁定到物理内存中(因为它们必须用于 DMA 操作),因此源内存memcpy()在这方面可能不是问题。你可能会考虑使用 Win32 VirtualAlloc()API 来确保你的目标内存memcpy()被提交(我认为这VirtualAlloc()是正确的 API,但我可能忘记了一个更好的 API - 我已经有一段时间没有了需要做这样的事情)。

Finally, see if you can use the technique explained by Skizzto avoid the memcpy()altogether - that's your best bet if resources permit.

最后,看看你是否可以使用Skizz 解释的技术来memcpy()完全避免- 如果资源允许,这是你最好的选择。

回答by Simone

First of all, you need to check that memory is aligned on 16 byte boundary, otherwise you get penalties. This is the most important thing.

首先,您需要检查内存是否在 16 字节边界上对齐,否则会受到处罚。这是最重要的事情。

If you don't need a standard-compliant solution, you could check if things improve by using some compiler specific extension such as memcpy64(check with your compiler doc if there's something available). Fact is that memcpymust be able to deal with single byte copy, but moving 4 or 8 bytes at a time is much faster if you don't have this restriction.

如果您不需要符合标准的解决方案,您可以通过使用一些特定于编译器的扩展来检查情况是否有所改善,例如memcpy64(如果有可用的东西,请查看您的编译器文档)。事实是memcpy必须能够处理单字节复制,但如果没有此限制,一次移动 4 或 8 个字节会快得多。

Again, is it an option for you to write inline assembly code?

同样,您可以选择编写内联汇编代码吗?

回答by Stéphan Kochen

Perhaps you can explain some more about how you're processing the larger memory area?

也许您可以解释更多有关如何处理更大内存区域的信息?

Would it be possible within your application to simply pass ownership of the buffer, rather than copy it? This would eliminate the problem altogether.

是否可以在您的应用程序中简单地传递缓冲区的所有权,而不是复制它?这将完全消除问题。

Or are you using memcpyfor more than just copying? Perhaps you're using the larger area of memory to build a sequential stream of data from what you've captured? Especially if you're processing one character at a time, you may be able to meet halfway. For example, it may be possible to adapt your processing code to accommodate for a stream represented as ‘an array of buffers', rather than ‘a continuous memory area'.

或者您的memcpy用途不仅仅是复制?也许您正在使用更大的内存区域来根据您捕获的内容构建连续的数据流?特别是如果您一次处理一个字符,您可能会遇到一半。例如,可以调整您的处理代码以适应表示为“缓冲区数组”而不是“连续内存区域”的流。

回答by Christopher

You can write a better implementation of memcpy using SSE2 registers. The version in VC2010 does this already. So the question is more, if you are handing it aligned memory.

您可以使用 SSE2 寄存器编写更好的 memcpy 实现。VC2010 中的版本已经这样做了。所以问题更多了,如果你把它交给对齐的内存。

Maybe you can do better then the version of VC 2010, but it does need some understanding, of how to do it.

也许你可以比 VC 2010 的版本做得更好,但它确实需要一些了解,如何去做。

PS: You can pass the buffer to the user mode program in an inverted call, to prevent the copy altogether.

PS:您可以在反向调用中将缓冲区传递给用户模式程序,以完全防止复制。

回答by R.. GitHub STOP HELPING ICE

One source I would recommend you read is MPlayer's fast_memcpyfunction. Also consider the expected usage patterns, and note that modern cpus have special store instructions which let you inform the cpu whether or not you will need to read back the data you're writing. Using the instructions that indicate you won't be reading back the data (and thus it doesn't need to be cached) can be a huge win for large memcpyoperations.

我建议您阅读的一个来源是 MPlayer 的fast_memcpy功能。还要考虑预期的使用模式,并注意现代 cpu 具有特殊的存储指令,可让您通知 cpu 您是否需要读回您正在写入的数据。使用指示您不会读回数据(因此不需要缓存)的memcpy指令对于大型操作来说可能是一个巨大的胜利。