Linux内核设备驱动程序从设备到用户空间内存的DMA

Question

提问by Ian Vaughan

I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.

我想尽快从启用 DMA 的 PCIe 硬件设备获取数据到用户空间。

Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"

问：如何将“直接 I/O 到用户空间与/和/通过 DMA 传输”结合起来

Reading through LDD3, it seems that I need to perform a few different types of IO operations!?
dma_alloc_coherentgives me the physical address that I can pass to the hardware device. But would need to have setup get_user_pagesand perform a copy_to_usertype call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space. LDD3 p453: /* Only now is it safe to access the buffer, copy to user, etc. */
What I ideally want is some memory that:
- I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
- I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
- the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).

通读LDD3，好像需要进行几种不同类型的IO操作！？
dma_alloc_coherent给了我可以传递给硬件设备的物理地址。但是需要在传输完成时get_user_pages进行设置并执行copy_to_user类型调用。这似乎是一种浪费，要求设备 DMA 进入内核内存（充当缓冲区），然后再次将其传输到用户空间。LDD3 p453：/* Only now is it safe to access the buffer, copy to user, etc. */
我理想中想要的是一些记忆：
- 我可以在用户空间中使用（也许通过 ioctl 调用请求驱动程序来创建可 DMA 的内存/缓冲区？）
- 我可以从中获取物理地址以传递给设备，以便所有用户空间要做的就是对驱动程序执行读取
- read 方法将激活 DMA 传输，阻塞等待 DMA 完成中断，然后释放用户空间读取（用户空间现在可以安全使用/读取内存）。

Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pagesdma_map_page?

我是否需要单页流映射、设置映射和用户空间缓冲区映射get_user_pagesdma_map_page？

My code so far sets up get_user_pagesat the given address from user-space (I call this the Direct I/O part). Then, dma_map_pagewith a page from get_user_pages. I give the device the return value from dma_map_pageas the DMA physical transfer address.

到目前为止，我的代码是get_user_pages在用户空间的给定地址处设置的（我称之为直接 I/O 部分）。然后，dma_map_page从get_user_pages. 我为设备提供了dma_map_page作为 DMA 物理传输地址的返回值。

I am using some kernel modules as reference: drivers_scsi_st.cand drivers-net-sh_eth.c. I would look at infiniband code, but cant find which one is the most basic!

我正在使用一些内核模块作为参考：drivers_scsi_st.c和drivers-net-sh_eth.c. 我会看 infiniband 代码，但找不到哪一个是最基本的！

Many thanks in advance.

提前谢谢了。

Answer 1

采纳答案by Rakis

I'm actually working on exactly the same thing right now and I'm going the ioctl()route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and an ioctl()will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.

我现在实际上正在做完全相同的事情，而且我正在走这ioctl()条路。一般的想法是让用户空间分配用于 DMA 传输的缓冲区，并将用于ioctl()将此缓冲区的大小和地址传递给设备驱动程序。然后，驱动程序将使用 scatter-gather 列表和流式 DMA API 将数据直接传入和传出设备和用户空间缓冲区。

The implementation strategy I'm using is that the ioctl()in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error the ioctl()exits and returns to userspace

我使用的实现策略是ioctl()驱动程序中的 DMA 以 256k 的块为用户空间缓冲区（这是硬件对它可以处理的分散/收集条目数量的限制）进入循环。这被隔离在一个函数中，该函数在每次传输完成之前都会阻塞（见下文）。当所有字节传输完毕或增量传输函数返回错误时ioctl()退出并返回用户空间

Pseudo code for the ioctl()

的伪代码 ioctl()

/*serialize all DMA transfers to/from the device*/
if (mutex_lock_interruptible( &device_ptr->mtx ) )
    return -EINTR;

chunk_data = (unsigned long) user_space_addr;
while( *transferred < total_bytes && !ret ) {
    chunk_bytes = total_bytes - *transferred;
    if (chunk_bytes > HW_DMA_MAX)
        chunk_bytes = HW_DMA_MAX; /* 256kb limit imposed by my device */
    ret = transfer_chunk(device_ptr, chunk_data, chunk_bytes, transferred);
    chunk_data += chunk_bytes;
    chunk_offset += chunk_bytes;
}

mutex_unlock(&device_ptr->mtx);

Pseudo code for incremental transfer function:

增量传递函数的伪代码：

/*Assuming the userspace pointer is passed as an unsigned long, */
/*calculate the first,last, and number of pages being transferred via*/

first_page = (udata & PAGE_MASK) >> PAGE_SHIFT;
last_page = ((udata+nbytes-1) & PAGE_MASK) >> PAGE_SHIFT;
first_page_offset = udata & PAGE_MASK;
npages = last_page - first_page + 1;

/* Ensure that all userspace pages are locked in memory for the */
/* duration of the DMA transfer */

down_read(&current->mm->mmap_sem);
ret = get_user_pages(current,
                     current->mm,
                     udata,
                     npages,
                     is_writing_to_userspace,
                     0,
                     &pages_array,
                     NULL);
up_read(&current->mm->mmap_sem);

/* Map a scatter-gather list to point at the userspace pages */

/*first*/
sg_set_page(&sglist[0], pages_array[0], PAGE_SIZE - fp_offset, fp_offset);

/*middle*/
for(i=1; i < npages-1; i++)
    sg_set_page(&sglist[i], pages_array[i], PAGE_SIZE, 0);

/*last*/
if (npages > 1) {
    sg_set_page(&sglist[npages-1], pages_array[npages-1],
        nbytes - (PAGE_SIZE - fp_offset) - ((npages-2)*PAGE_SIZE), 0);
}

/* Do the hardware specific thing to give it the scatter-gather list
   and tell it to start the DMA transfer */

/* Wait for the DMA transfer to complete */
ret = wait_event_interruptible_timeout( &device_ptr->dma_wait, 
         &device_ptr->flag_dma_done, HZ*2 );

if (ret == 0)
    /* DMA operation timed out */
else if (ret == -ERESTARTSYS )
    /* DMA operation interrupted by signal */
else {
    /* DMA success */
    *transferred += nbytes;
    return 0;
}

The interrupt handler is exceptionally brief:

中断处理程序非常简短：

/* Do hardware specific thing to make the device happy */

/* Wake the thread waiting for this DMA operation to complete */
device_ptr->flag_dma_done = 1;
wake_up_interruptible(device_ptr->dma_wait);

Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).

请注意，这只是一个通用的方法，过去几周我一直在研究这个驱动程序，但还没有实际测试它......所以请不要把这个伪代码当作福音，一定要加倍检查所有逻辑和参数;-)。

Answer 2

回答by Roland

You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so posix_memalign()is a handy API to use.

您基本上有正确的想法：在 2.1 中，您可以让用户空间分配任何旧内存。您确实希望它与页面对齐，因此posix_memalign()使用方便的 API 也是如此。

Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of struct page*-- user_buf_size/PAGE_SIZEentries -- and use get_user_pages()to get a list of struct page* for the userspace buffer.

然后让用户空间以某种方式传入该缓冲区的用户空间虚拟地址和大小；ioctl() 是一种很好的快速而肮脏的方法来做到这一点。在内核中，分配一个适当大小的缓冲区数组struct page*--user_buf_size/PAGE_SIZE条目 -- 并用于get_user_pages()获取用户空间缓冲区的 struct page* 列表。

Once you have that, you can allocate an array of struct scatterlistthat is the same size as your page array and loop through the list of pages doing sg_set_page(). After the sg list is set up, you do dma_map_sg()on the array of scatterlist and then you can get the sg_dma_addressand sg_dma_lenfor each entry in the scatterlist (note you have to use the return value of dma_map_sg()because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).

一旦有了它，您就可以分配一个struct scatterlist与页面数组大小相同的数组，并循环遍历页面列表执行sg_set_page(). 神光列表设置后，你做dma_map_sg()的散布的阵列上，然后就可以得到sg_dma_address，并sg_dma_len在散布表（每个条目注意您必须使用的返回值dma_map_sg()，因为你可以用更少的映射条目，因为最终事情都有可能被 DMA 映射代码合并）。

That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.

这为您提供了要传递到设备的所有总线地址，然后您可以触发 DMA 并根据需要等待它。您拥有的基于 read() 的方案可能没问题。

You can refer to drivers/infiniband/core/umem.c, specifically ib_umem_get(), for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.

你可以参考drivers/infiniband/core/umem.c，特别ib_umem_get()是一些构建这个映射的代码，尽管这些代码需要处理的通用性可能会让它有点混乱。

Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use get_free_pages()to allocate a physically contiguous buffer and use dma_map_page()on that. To give userspace access to that memory, your driver just needs to implement an mmapmethod instead of the ioctl as described above.

或者，如果您的设备不能很好地处理分散/收集列表并且您想要连续的内存，您可以使用get_free_pages()分配一个物理连续的缓冲区并dma_map_page()在其上使用。为了让用户空间访问该内存，您的驱动程序只需要实现一个mmap方法而不是如上所述的 ioctl。

Answer 3

回答by Roland

At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:

在某些时候，我希望允许用户空间应用程序分配 DMA 缓冲区并将其映射到用户空间并获取物理地址，以便能够完全从用户空间控制我的设备并执行 DMA 事务（总线主控），完全绕过Linux内核。不过，我使用了一些不同的方法。首先，我从一个最小的内核模块开始，它初始化/探测 PCIe 设备并创建一个字符设备。然后该驱动程序允许用户空间应用程序做两件事：

Map PCIe device's I/O bar into user-space using remap_pfn_range()function.
Allocate and free DMA buffers, map them to user space and pass a physical bus address to user-space application.

使用remap_pfn_range()函数将PCIe 设备的 I/O 条映射到用户空间。
分配和释放 DMA 缓冲区，将它们映射到用户空间并将物理总线地址传递给用户空间应用程序。

Basically, it boils down to a custom implementation of mmap()call (though file_operations). One for I/O bar is easy:

基本上，它归结为mmap()call的自定义实现（尽管file_operations）。一种用于 I/O 条的方法很简单：

struct vm_operations_struct a2gx_bar_vma_ops = {
};

static int a2gx_cdev_mmap_bar2(struct file *filp, struct vm_area_struct *vma)
{
    struct a2gx_dev *dev;
    size_t size;

    size = vma->vm_end - vma->vm_start;
    if (size != 134217728)
        return -EIO;

    dev = filp->private_data;
    vma->vm_ops = &a2gx_bar_vma_ops;
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    vma->vm_private_data = dev;

    if (remap_pfn_range(vma, vma->vm_start,
                        vmalloc_to_pfn(dev->bar2),
                        size, vma->vm_page_prot))
    {
        return -EAGAIN;
    }

    return 0;
}

And another one that allocates DMA buffers using pci_alloc_consistent()is a little bit more complicated:

另一个使用分配 DMA 缓冲区的方法pci_alloc_consistent()稍微复杂一些：

static void a2gx_dma_vma_close(struct vm_area_struct *vma)
{
    struct a2gx_dma_buf *buf;
    struct a2gx_dev *dev;

    buf = vma->vm_private_data;
    dev = buf->priv_data;

    pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr, buf->dma_addr);
    buf->cpu_addr = NULL; /* Mark this buffer data structure as unused/free */
}

struct vm_operations_struct a2gx_dma_vma_ops = {
    .close = a2gx_dma_vma_close
};

static int a2gx_cdev_mmap_dma(struct file *filp, struct vm_area_struct *vma)
{
    struct a2gx_dev *dev;
    struct a2gx_dma_buf *buf;
    size_t size;
    unsigned int i;

    /* Obtain a pointer to our device structure and calculate the size
       of the requested DMA buffer */
    dev = filp->private_data;
    size = vma->vm_end - vma->vm_start;

    if (size < sizeof(unsigned long))
        return -EINVAL; /* Something fishy is happening */

    /* Find a structure where we can store extra information about this
       buffer to be able to release it later. */
    for (i = 0; i < A2GX_DMA_BUF_MAX; ++i) {
        buf = &dev->dma_buf[i];
        if (buf->cpu_addr == NULL)
            break;
    }

    if (buf->cpu_addr != NULL)
        return -ENOBUFS; /* Oops, hit the limit of allowed number of
                            allocated buffers. Change A2GX_DMA_BUF_MAX and
                            recompile? */

    /* Allocate consistent memory that can be used for DMA transactions */
    buf->cpu_addr = pci_alloc_consistent(dev->pci_dev, size, &buf->dma_addr);
    if (buf->cpu_addr == NULL)
        return -ENOMEM; /* Out of juice */

    /* There is no way to pass extra information to the user. And I am too lazy
       to implement this mmap() call using ioctl(). So we simply tell the user
       the bus address of this buffer by copying it to the allocated buffer
       itself. Hacks, hacks everywhere. */
    memcpy(buf->cpu_addr, &buf->dma_addr, sizeof(buf->dma_addr));

    buf->size = size;
    buf->priv_data = dev;
    vma->vm_ops = &a2gx_dma_vma_ops;
    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    vma->vm_private_data = buf;

    /*
     * Map this DMA buffer into user space.
     */
    if (remap_pfn_range(vma, vma->vm_start,
                        vmalloc_to_pfn(buf->cpu_addr),
                        size, vma->vm_page_prot))
    {
        /* Out of luck, rollback... */
        pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr,
                            buf->dma_addr);
        buf->cpu_addr = NULL;
        return -EAGAIN;
    }

    return 0; /* All good! */
}

Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.

一旦这些准备就绪，用户空间应用程序几乎可以做任何事情——通过从/向 I/O 寄存器读取/写入来控制设备，分配和释放任意大小的 DMA 缓冲区，以及让设备执行 DMA 事务。唯一缺少的部分是中断处理。我在用户空间进行轮询，燃烧我的 CPU，并禁用了中断。

Hope it helps. Good Luck!

希望能帮助到你。祝你好运！

Answer 4

回答by fbp

I'm getting confused with the direction to implement. I want to...

我对实施的方向感到困惑。我想要...

Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?

在设计驱动程序时考虑应用程序。
数据移动的性质、频率、大小以及系统中可能发生的其他事情是什么？

Is the traditional read/write API sufficient? Is direct mapping the device into user space OK? Is a reflective (semi-coherent) shared memory desirable?

传统的读/写 API 是否足够？将设备直接映射到用户空间可以吗？是否需要反射（半连贯）共享内存？

Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...

如果数据本身易于理解，则手动操作数据（读/写）是一个不错的选择。对于内联副本，使用通用 VM 和读/写可能就足够了。将不可缓存的访问直接映射到外围设备很方便，但可能很笨拙。如果访问是相对不频繁的大块移动，则使用常规内存、具有驱动引脚、转换地址、DMA 和释放页面可能是有意义的。作为优化，页面（可能很大）可以预先固定和翻译；然后驱动器可以识别准备好的内存并避免动态翻译的复杂性。如果有很多小的 I/O 操作，让驱动器异步运行是有意义的。如果优雅很重要，VM 脏页标志可用于自动识别需要移动的内容，并且 (meta_sync()) 调用可用于刷新页面。也许上述作品的混合......

Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.

在深入研究细节之前，人们常常不着眼于更大的问题。通常最简单的解决方案就足够了。构建行为模型的一点努力可以帮助指导更可取的 API。

Answer 5

回答by Suman

first_page_offset = udata & PAGE_MASK;

It seems wrong. It should be either:

好像不对。它应该是：

first_page_offset = udata & ~PAGE_MASK;

or

或者

first_page_offset = udata & (PAGE_SIZE - 1)

Answer 6

回答by SlawekS

It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.

值得一提的是，具有 Scatter-Gather DMA 支持和用户空间内存分配的驱动程序效率最高，性能最高。但是，如果我们不需要高性能或者我们想在一些简化的条件下开发驱动程序，我们可以使用一些技巧。

Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by copy_to_user(user_buffer, kernel_dma_buffer, count);user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of kernel_dma_bufferallocation. It might by memory obtained from dma_alloc_coherent()call for example.

放弃零拷贝设计。当数据吞吐量不是太大时，值得考虑。在这样的设计中，数据可以通过copy_to_user(user_buffer, kernel_dma_buffer, count);user_buffer复制给用户，例如字符设备 read() 系统调用实现中的缓冲区参数。我们仍然需要注意kernel_dma_buffer分配。例如，它可能通过从dma_alloc_coherent()调用中获得的内存。

The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add mem=20GBto kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() as

另一个技巧是在启动时限制系统内存，然后将其用作巨大的连续 DMA 缓冲区。它在驱动程序和 FPGA DMA 控制器开发期间特别有用，而不推荐在生产环境中使用。假设 PC 有 32GB 的 RAM。如果我们添加mem=20GB到内核启动参数列表中，我们可以使用 12GB 作为巨大的连续 dma 缓冲区。要将此内存映射到用户空间，只需将 mmap() 实现为

remap_pfn_range(vma,
    vma->vm_start,
    (0x500000000 >> PAGE_SHIFT) + vma->vm_pgoff, 
    vma->vm_end - vma->vm_start,
    vma->vm_page_prot)

Of course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).

当然，这 12GB 被操作系统完全省略，只能由已将其映射到其地址空间的进程使用。我们可以尝试通过使用连续内存分配器 (CMA) 来避免它。

Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.

同样，上述技巧不会取代完整的 Scatter-Gather、零拷贝 DMA 驱动程序，但在开发时间或某些性能较低的平台中很有用。

Linux内核设备驱动程序从设备到用户空间内存的DMA

提问by Ian Vaughan

采纳答案by Rakis

回答by Roland

回答by Roland

回答by fbp

回答by Suman

回答by SlawekS

相关推荐

最近更新

标签

Linux内核设备驱动程序从设备到用户空间内存的DMA

提问by Ian Vaughan

采纳答案by Rakis

回答by Roland

回答by Roland

回答by fbp

回答by Suman

回答by SlawekS

相关推荐

C# 如何迭代多维数组的行和列？

Linux 如何在英特尔图形上为“监视器插入”创建回调？

C# Windows 证书存储

将 Unix/Linux 时间转换为 Windows 时间的方法

相关推荐

最近更新

标签