Linux UDP丢包原因
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5913298/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Causes of Linux UDP packet drops
提问by Matt
I have a Linux C++ application which receives sequenced UDP packets. Because of the sequencing, I can easily determine when a packet is lost or re-ordered, i.e. when a "gap" is encountered. The system has a recovery mechanism to handle gaps, however, it is best to avoid gaps in the first place. Using a simple libpcap-based packet sniffer, I have determined that there are no gaps in the data at the hardware level. However, I am seeing a lot of gaps in my application. This suggests the kernel is dropping packets; it is confirmed by looking at the /proc/net/snmpfile. When my application encounters a gap, the Udp InErrorscounter increases.
我有一个 Linux C++ 应用程序,它接收排序的 UDP 数据包。由于排序,我可以轻松确定数据包何时丢失或重新排序,即何时遇到“间隙”。系统具有处理间隙的恢复机制,但是,最好首先避免间隙。使用一个简单的基于 libpcap 的数据包嗅探器,我已经确定硬件级别的数据中没有间隙。但是,我发现我的申请中有很多空白。这表明内核正在丢弃数据包;通过查看/proc/net/snmp文件可以确认。当我的应用程序遇到间隙时,Udp InErrors计数器会增加。
At the system level, we have increased the max receive buffer:
在系统层面,我们增加了最大接收缓冲区:
# sysctl net.core.rmem_max
net.core.rmem_max = 33554432
At the application level, we have increased the receive buffer size:
在应用程序级别,我们增加了接收缓冲区大小:
int sockbufsize = 33554432
int ret = setsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF,
(char *)&sockbufsize, (int)sizeof(sockbufsize));
// check return code
sockbufsize = 0;
ret = getsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF,
(char*)&sockbufsize, &size);
// print sockbufsize
After the call to getsockopt(), the printed value is always 2x what it is set to (67108864 in the example above), but I believe that is to be expected.
在调用getsockopt() 之后,打印的值始终是设置值的2 倍(上例中为67108864),但我相信这是可以预料的。
I know that failure to consume data quickly enough can result in packet loss. However, all this application does is check the sequencing, then push the data into a queue; the actual processing is done in another thread. Furthermore, the machine is modern (dual Xeon X5560, 8 GB RAM) and very lightly loaded. We have literally dozens of identical applications receiving data at a much higher ratethat do not experience this problem.
我知道未能足够快地消耗数据会导致数据包丢失。但是,该应用程序所做的只是检查排序,然后将数据推送到队列中;实际处理是在另一个线程中完成的。此外,这台机器很现代(双 Xeon X5560,8 GB RAM)并且负载很轻。我们实际上有几十个相同的应用程序以更高的速率接收数据,它们没有遇到这个问题。
Besides a too-slow consuming application, are there other reasons why the Linux kernel might drop UDP packets?
除了应用程序消耗太慢之外,Linux 内核可能会丢弃 UDP 数据包还有其他原因吗?
FWIW, this is on CentOS 4, with kernel 2.6.9-89.0.25.ELlargesmp.
FWIW,这是在 CentOS 4 上,内核为 2.6.9-89.0.25.ELlargesmp。
采纳答案by racic
I had a similar problem with my program. Its task is to receive udp packets in one thread and, using a blocking queue, write them to the database with another thread.
我的程序也有类似的问题。它的任务是在一个线程中接收 udp 数据包,并使用阻塞队列将它们写入到另一个线程的数据库中。
I noticed (using vmstat 1
) that when the system was experiencing heavy I/O wait operations (reads) my application didn't receive packets, they were being received by the system though.
我注意到(使用vmstat 1
)当系统遇到繁重的 I/O 等待操作(读取)时,我的应用程序没有收到数据包,但系统正在接收它们。
The problem was - when heavy I/O wait occured, the thread that was writing to the database was being I/O starved while holding the queue mutex. This way the udp buffer was being overflown by incoming packets, because main thread that was receiving them was hanging on the pthred_mutex_lock()
.
问题是 - 当大量 I/O 等待发生时,正在写入数据库的线程在持有队列互斥锁时 I/O 不足。这样 udp 缓冲区就会被传入的数据包溢出,因为接收它们的主线程挂在pthred_mutex_lock()
.
I resolved it by playing with ioniceness (ionice
command) of my process and the database process. Changing I/O sched class to Best Effort helped. Surprisingly I'm not able to reproduce this problem now even with default I/O niceness.
My kernel is 2.6.32-71.el6.x86_64.
我通过玩弄ionice
我的进程和数据库进程的离子性(命令)来解决它。将 I/O sched 类更改为 Best Effort 有所帮助。令人惊讶的是,即使使用默认的 I/O nice,我现在也无法重现这个问题。我的内核是 2.6.32-71.el6.x86_64。
I'm still developing this app so I'll try to update my post once I know more.
我仍在开发这个应用程序,所以一旦我知道更多,我会尝试更新我的帖子。
回答by dgq7
int ret = setsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF, (char *)&sockbufsize, (int)sizeof(sockbufsize));
int ret = setsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF, (char *)&sockbufsize, (int)sizeof(sockbufsize));
First of all, setsockopt
takes (int, int, int, void *, socklen_t), so there are no casts required.
首先,setsockopt
需要 (int, int, int, void *, socklen_t),所以不需要强制转换。
Using a simple libpcap-based packet sniffer, I have determined that there are no gaps in the data at the hardware level. However, I am seeing a lot of gaps in my application. This suggests the kernel is dropping packets;
使用一个简单的基于 libpcap 的数据包嗅探器,我已经确定硬件级别的数据中没有间隙。但是,我发现我的申请中有很多空白。这表明内核正在丢弃数据包;
It suggests that your environment is not fast enough. Packet capturing is known to be processing intensive, and you will observe that the global rate of transmissions on an interface will drop as you start capturing programs such as iptraf-ng or tcpdump on one.
这表明您的环境不够快。众所周知,数据包捕获是处理密集型的,当您开始在一个接口上捕获诸如 iptraf-ng 或 tcpdump 之类的程序时,您会发现接口上的全局传输速率会下降。
回答by Steve-o
If you have more threads than cores and equal thread priority between them it is likely that the receiving thread is starved for time to flush the incoming buffer. Consider running that thread at a higher priority level than the others.
如果线程数多于内核数,并且它们之间的线程优先级相等,则接收线程很可能没有时间刷新传入缓冲区。考虑以比其他线程更高的优先级运行该线程。
Similarly, although often less productive is to bind the thread for receiving to one core so that you do not suffer overheads of switching between cores and associated cache flushes.
类似地,虽然通常效率较低的是将用于接收的线程绑定到一个核心,这样您就不会遭受在核心和相关缓存刷新之间切换的开销。
回答by dfreese
I don't enough reputation to comment, but similar to @racic, I had a program where I had one receive thread, and one processing thread with a blocking queue between them. I noticed the same issue with dropping packets because the receiving thread was waiting for a lock on the blocking queue.
我没有足够的声誉来评论,但与@racic 类似,我有一个程序,其中有一个接收线程和一个处理线程,它们之间有一个阻塞队列。我注意到丢弃数据包的相同问题,因为接收线程正在等待阻塞队列上的锁定。
To resolve this I added a smaller local buffer to the receiving thread, and had it only push data into the buffer then it wasn't locked (using std::mutex::try_lock).
为了解决这个问题,我向接收线程添加了一个较小的本地缓冲区,让它只将数据推送到缓冲区中,然后它就没有被锁定(使用 std::mutex::try_lock)。