Linux 为什么 CUDA 固定内存这么快?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5736968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:45:30  来源:igfitidea点击:

Why is CUDA pinned memory so fast?

c++clinuxcuda

提问by Gearoid Murphy

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:

当我使用固定内存进行 CUDA 数据传输时,我观察到数据传输的显着加速。在 linux 上,实现此目的的底层系统调用是 mlock。从 mlock 的手册页中,它指出锁定页面可以防止它被换出:

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;

mlock() 锁定地址范围内的页面,从 addr 开始并持续 len 字节。当调用成功返回时,所有包含部分指定地址范围的页面都保证驻留在 RAM 中;

In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

在我的测试中,我的系统上有一些空闲内存,因此内存页面可能被换出的风险从来没有,但我仍然观察到加速。任何人都可以解释这里到底发生了什么?,非常感谢任何见解或信息。

采纳答案by osgx

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

CUDA 驱动程序检查内存范围是否被锁定,然后它将使用不同的代码路径。锁定的内存存储在物理内存 (RAM) 中,因此设备可以在没有 CPU 帮助的情况下获取它(DMA,又名异步复制;设备只需要物理页面列表)。未锁定内存在访问时会产生页面错误,并且它不仅存储在内存中(例如它可以在交换中),因此驱动程序需要访问非锁定内存的每一页,将其复制到固定缓冲区中并传递它到 DMA(同步,逐页复制)。

As described here http://forums.nvidia.com/index.php?showtopic=164661

如此处所述http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

异步内存复制调用使用的主机内存需要通过 cudaMallocHost 或 cudaHostAlloc 进行页面锁定。

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

我还可以推荐在 developer.download.nvidia.com 上查看 cudaMemcpyAsync 和 cudaHostAlloc 手册。HostAlloc 说 cuda 驱动程序可以检测固定内存:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

驱动程序跟踪使用 this(cudaHostAlloc) 函数分配的虚拟内存范围,并自动加速对 cudaMemcpy() 等函数的调用。

回答by R.. GitHub STOP HELPING ICE

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

如果内存页还没有被访问,它们可能永远不会从. 特别是,新分配的页面将是通用“零页面”的虚拟副本,并且在写入之前没有物理实例化。磁盘上文件的新映射同样将纯粹保留在磁盘上,直到它们被读取或写入。

回答by Shen Yang

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

CUDA 使用 DMA 将固定内存传输到 GPU。可分页主机内存不能与 DMA 一起使用,因为它们可能驻留在磁盘上。如果内存未固定(即页面锁定),则首先将其复制到页面锁定的“暂存”缓冲区,然后通过 DMA 复制到 GPU。因此,使用固定内存可以节省从可分页主机内存复制到页面锁定主机内存的时间。