UDP 数据包被 Linux 内核丢弃

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10899937/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 06:40:05  来源:igfitidea点击:

UDP packet drops by linux kernel

linuxudpmulticastpacket-loss

提问by viktorgt

I have a server which sends UDP packets via multicast and a number of clients which are listing to those multicast packets. Each packet has a fixed size of 1040 Bytes, the whole data size which is sent by the server is 3GByte.

我有一个通过多播发送 UDP 数据包的服务器和一些列出这些多播数据包的客户端。每个数据包的固定大小为 1040 Bytes,服务器发送的整个数据大小为 3GByte。

My environment is follows:

我的环境如下:

1 Gbit Ethernet Network

1 Gbit 以太网络

40 Nodes, 1 Sender Node and 39 receiver Nodes. All Nodes have the same hardware configuration: 2 AMD CPUs, each CPU has 2 Cores @2,6GHz

40 个节点,1 个发送节点和 39 个接收节点。所有节点具有相同的硬件配置:2 个 AMD CPU,每个 CPU 有 2 个内核 @2,6GHz

On the client side, one thread reads the socket and put the data into a queue. One additional thread pops the data from the queue and does some light weight processing.

在客户端,一个线程读取套接字并将数据放入队列中。一个额外的线程从队列中弹出数据并进行一些轻量级处理。

During the multicast transmission I recognize a packet drop rate of 30% on the node side. By observing the netstat –su statistics I can say, that the missing packets by the client application are equal to the RcvbufErrors value from the netstat output.

在多播传输期间,我发现节点侧的丢包率为 30%。通过观察 netstat –su 统计信息,我可以说,客户端应用程序丢失的数据包等于来自 netstat 输出的 RcvbufErrors 值。

That means that all missing packets are dropped by the OS because the socket buffer was full, but I do not understand why the capturing thread is not able to read the buffer in time. During the transmission, 2 of the 4 cores are utilized by 75%, the rest is sleeping. I'm the only one who is using these nodes, and I would assume that this kind of machines have no problem to handle 1Gbit bandwidth. I have already done some optimization, by adding g++ compiler flags for amd cpus, this decrease the packet drop rate to 10%, but it is still too high in my opinion.

这意味着所有丢失的数据包都被操作系统丢弃,因为套接字缓冲区已满,但我不明白为什么捕获线程无法及时读取缓冲区。在传输过程中,4 个内核中的 2 个使用了 75%,其余的处于休眠状态。我是唯一一个使用这些节点的人,我认为这种机器处理 1Gbit 带宽没有问题。我已经做了一些优化,通过为 amd cpus 添加 g++ 编译器标志,这将丢包率降低到 10%,但我认为它仍然太高。

Of course I know that UDP is not reliable, I have my own correction protocol.

当然我知道UDP不可靠,我有自己的修正协议。

I do not have any administration permissions, so it's not possible for me to change the system parameters.

我没有任何管理权限,因此我无法更改系统参数。

Any hints how can I increase the performance?

任何提示如何提高性能?

EDIT: I solved this issue by using 2 threads which are reading the socket. The recv socket buffer still becomes full sometimes. But the average drop is under 1%, so it isn't a problem to handle it.

编辑:我通过使用 2 个读取套接字的线程解决了这个问题。recv 套接字缓冲区有时仍会变满。但是平均下降不到1%,所以处理起来不是问题。

回答by Nikolai Fetissov

Aside from obvious removal of everything non-essential from the socket read loop:

除了从套接字读取循环中明显删除所有非必要内容之外:

  • Increase socket receive buffer with setsockopt(2),
  • Use recvmmsg(2), if your kernel supports it, to reduce number of system calls and kernel-userland copies,
  • Consider non-blocking approach with edge-triggered epoll(7),
  • See if you really need threads here, locking/synchronization is very expensive.
  • 增加套接字接收缓冲区setsockopt(2)
  • 使用recvmmsg(2),如果您的内核支持它,以减少系统调用和内核用户空间副本的数量,
  • 考虑使用边缘触发的非阻塞方法epoll(7)
  • 看看这里是否真的需要线程,锁定/同步非常昂贵。

回答by Mars Zhao

"On the client side, one thread reads the socket and put the data into a queue. " I guess the problem is in this thread. It is not receiving messages fast enough. Too much time is spent on something else, for example acquiring mutex when putting data into the queue. Try to optimize operations on the queue, such as use a lock-free queue.

“在客户端,一个线程读取套接字并将数据放入队列中。”我猜问题出在这个线程中。它接收消息的速度不够快。太多时间花在其他事情上,例如在将数据放入队列时获取互斥锁。尝试优化对队列的操作,例如使用无锁队列。

回答by Joe Damato

Tracking down network drops on Linux can be a bit difficult as there are many components where packet drops can happen. They can occur at the hardware level, in the network device subsystem, or in the protocol layers.

跟踪 Linux 上的网络丢失可能有点困难,因为有许多组件可能会发生数据包丢失。它们可以发生在硬件级别、网络设备子系统或协议层。

I wrote a very detailed blog postexplaining how to monitor and tune each component. It's a bit hard to summarize as a succinct answer here since there are so many different components that need to be monitored and tuned.

我写了一篇非常详细的博客文章,解释了如何监控和调整每个组件。在这里总结为一个简洁的答案有点困难,因为需要监视和调整的组件太多了。