Linux DMA 缓存一致性管理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7132284/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
DMA cache coherence management
提问by Michael
My question is this: how can I determine when it is safe to disable cache snooping when I am correctly using [pci_]dma_sync_single_for_{cpu,device}
in my device driver?
我的问题是:当我[pci_]dma_sync_single_for_{cpu,device}
在设备驱动程序中正确使用时,如何确定何时禁用缓存监听是安全的?
I'm working on a device driver for a device which writes directly to RAM over PCI Express (DMA), and am concerned about managing cache coherence. There is a control bit I can set when initiating DMA to enable or disable cache snooping during DMA, clearly for performance I would like to leave cache snooping disabled if at all possible.
我正在为通过 PCI Express (DMA) 直接写入 RAM 的设备开发设备驱动程序,并且担心管理缓存一致性。我可以在启动 DMA 时设置一个控制位,以在 DMA 期间启用或禁用缓存侦听,显然为了性能,我希望尽可能禁用缓存侦听。
In the interrupt routine I call pci_dma_sync_single_for_cpu()
and ..._for_device()
as appropriate, when switching DMA buffers, but on 32-bit Linux 2.6.18 (RHEL 5) it turns out that these commands are macros which expand to nothing ... which explains why my device returns garbage when cache snooping is disabled on this kernel!
在中断程序中调用我pci_dma_sync_single_for_cpu()
和..._for_device()
酌情交换DMA缓冲区,但在32位的Linux 2.6.18(RHEL 5)当事实证明,这些命令是其扩大到什么宏...这也解释了为什么我的设备返回垃圾在此内核上禁用缓存侦听时!
I've trawled through the history of the kernel sources, and it seems that up until 2.6.25 only 64-bit x86 had hooks for DMA synchronisation. From 2.6.26 there seems to be a generic unified indirection mechanism for DMA synchronisation (currently in include/asm-generic/dma-mapping-common.h
) via fields sync_single_for_{cpu,device}
of dma_map_ops
, but so far I've failed to find any definitions of these operations.
我浏览了内核源代码的历史,似乎直到 2.6.25 只有 64 位 x86 具有用于 DMA 同步的钩子。从2.6.26似乎是DMA同步(目前通用的统一间接机制include/asm-generic/dma-mapping-common.h
通过字段)sync_single_for_{cpu,device}
的dma_map_ops
,但到目前为止,我一直没能找到这些操作的任何定义。
回答by Niall Douglas
I'm really surprised no one has answered this, so here we go on a non-Linux specific answer (I have insufficient knowledge of the Linux kernel itself to be more specific) ...
我真的很惊讶没有人回答这个问题,所以在这里我们继续一个非 Linux 特定的答案(我对 Linux 内核本身的了解不够更具体)......
Cache snooping simply tells the DMA controller to send cache invalidation requests to all CPUs for the memory being DMAed into. This obviously adds load to the cache coherency bus, and it scales particularly badly with additional processors as not all CPUs will have a single hop connection with the DMA controller issuing the snoop. Therefore, the simple answer to "when it is safe to disable cache snooping" is when the memory being DMAed into either does not exist in any CPU cache OR its cache lines are marked as invalid. In other words, any attempt to read from the DMAed region will alwaysresult in a read from main memory.
缓存监听只是告诉 DMA 控制器向所有 CPU 发送缓存失效请求,以获取 DMA 进入的内存。这显然增加了缓存一致性总线的负载,并且它在附加处理器的情况下尤其严重,因为并非所有 CPU 都与发出监听的 DMA 控制器有单跳连接。因此,“何时禁用缓存监听是安全的”的简单答案是当被 DMA 写入的内存不存在于任何 CPU 缓存中或其缓存行被标记为无效时。换句话说,任何企图从DMAed区域读取将始终导致从主内存中读取。
So how do you ensure reads from a DMAed region will always go to main memory?
那么如何确保从 DMAed 区域读取的内容将始终进入主内存?
Back in the day before we had fancy features like DMA cache snooping, what we used to do was to pipeline DMA memory by feeding it through a series of broken up stages as follows:
早在我们拥有像 DMA 缓存监听这样的奇特功能的前一天,我们过去所做的是通过以下一系列分解阶段来对 DMA 内存进行管道传输:
Stage 1: Add "dirty" DMA memory region to the "dirty and needs to be cleaned" DMA memory list.
阶段 1:将“脏”的 DMA 内存区域添加到“脏且需要清理”的 DMA 内存列表中。
Stage 2: Next time the device interrupts with fresh DMA'ed data, issue an async local CPU cache invalidate for DMA segments in the "dirty and needs to be cleaned" list for all CPUs which might access those blocks (often each CPU runs its own lists made up of local memory blocks). Move said segments into a "clean" list.
第 2 阶段:下次设备使用新的 DMA 数据中断时,为可能访问这些块的所有 CPU(通常每个 CPU 运行其自己的列表由本地内存块组成)。将所述段移动到“干净”列表中。
Stage 3: Next DMA interrupt (which of course you're sure will not occur before the previous cache invalidate has completed), take a fresh region from the "clean" list and tell the device that its next DMA should go into that. Recycle any dirty blocks.
第 3 阶段:下一个 DMA 中断(当然您确定在前一个缓存失效完成之前不会发生),从“干净”列表中取出一个新区域并告诉设备它的下一个 DMA 应该进入该区域。回收所有脏块。
Stage 4: Repeat.
第四阶段:重复。
As much as this is more work, it has several major advantages. Firstly, you can pin DMA handling to a single CPU (typically the primary CPU0) or a single SMP node, which means only a single CPU/node need worry about cache invalidation. Secondly, you give the memory subsystem much more opportunity to hide memory latencies for you by spacing out operations over time and spreading out load on the cache coherency bus. The key for performance is generally to try and make any DMA occur on a CPU as close to the relevant DMA controller as possible and into memory as close to that CPU as possible.
尽管这是更多的工作,但它有几个主要优点。首先,您可以将 DMA 处理固定到单个 CPU(通常是主 CPU0)或单个 SMP 节点,这意味着只有单个 CPU/节点需要担心缓存失效。其次,通过在一段时间内间隔操作并分散缓存一致性总线上的负载,您可以为内存子系统提供更多机会为您隐藏内存延迟。性能的关键通常是尝试使任何 DMA 在尽可能靠近相关 DMA 控制器的 CPU 上发生,并在尽可能靠近该 CPU 的内存中发生。
If you alwayshand off newly DMAed into memory to user space and/or other CPUs, simply inject freshly acquired memory in at the front of the async cache invalidating pipeline. Some OSs (not sure about Linux) have an optimised routine for preordering zeroed memory, so the OS basically zeros memory in the background and keeps a quick satisfy cache around - it will pay you to keep new memory requests below that cached amount because zeroing memory is extremely slow. I'm not aware of any platform produced in the past ten years which uses hardware offloaded memory zeroing, so you must assume that all fresh memory may contain valid cache lines which need invalidating.
如果您总是将新的 DMA 转移到用户空间和/或其他 CPU 的内存中,只需在异步缓存失效管道的前端注入新获取的内存。一些操作系统(不确定 Linux)有一个优化的例程来预排序清零内存,所以操作系统基本上在后台清零内存并保持快速满足缓存 - 它会支付你将新内存请求保持在缓存量以下,因为清零内存非常慢。我不知道在过去十年中生产的任何平台使用硬件卸载内存归零,因此您必须假设所有新内存可能包含需要无效的有效缓存行。
I appreciate this only answers half your question, but it's better than nothing. Good luck!
我很感激这只能回答您的问题的一半,但这总比没有好。祝你好运!
Niall
尼尔
回答by Kees-Jan
Maybe a bit overdue, but:
也许有点逾期,但是:
If you disable cache snooping, hardware will no longer take care of cache coherency. Hence, the kernel needs to do this itself. Over the past few days, I've spent some tiem reviewing the X86 variants of [pci_]dma_sync_single_for_{cpu,device}. I've found no indication that they perform any efforts to maintain coherency. This seems consistent with the fact that cache snooping is by default turned on in the PCI(e) spec.
如果禁用缓存监听,硬件将不再负责缓存一致性。因此,内核需要自己做这件事。在过去的几天里,我花了一些时间回顾了 [pci_]dma_sync_single_for_{cpu,device} 的 X86 变体。我没有发现任何迹象表明他们为保持一致性做出了任何努力。这似乎与 PCI(e) 规范中缓存监听默认开启的事实一致。
Hence, if you are turning off cache snooping, you will have to maintain coherency yourself, in your driver. Possibly by calling clflush_cache_range() (X86) or similar?
因此,如果您要关闭缓存侦听,则必须自己在驱动程序中保持一致性。可能通过调用 clflush_cache_range() (X86) 或类似方法?
Refs:
参考: