java 什么会导致 TCP/IP 丢弃数据包而不丢弃连接?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/787415/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What can cause TCP/IP to drop packets without dropping the connection?
提问by Eddie
I have a web-based application and a client, both written in Java. For what it's worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpClient. The server blocks for up to a minute and if no messages have arrived for the client within that minute, the server returns HTTP 204 No Content. Otherwise, as soon as a message is ready for the client, it is returned with the body of an HTTP 200 OK.
我有一个基于 Web 的应用程序和一个客户端,都是用 Java 编写的。值得一提的是,客户端和服务器都在 Windows 上。客户端通过Apache HttpClient发出 HTTP GET 。服务器最多阻塞一分钟,如果在那一分钟内没有消息到达客户端,服务器将返回 HTTP 204 No Content。否则,一旦消息准备好发送给客户端,它就会与 HTTP 200 OK 的正文一起返回。
Here is what has me puzzled:Intermittently for a specific subset of clients -- always clients with demonstrably flaky network connections -- the client issues a GET, the server receives and processes the GET, but the client sits forever. Enabling debugging logs for the client, I see that HttpClient is still waiting for the very first line of the response.
这是让我感到困惑的地方:间歇性地对于特定的客户端子集——始终是具有明显不稳定网络连接的客户端——客户端发出 GET,服务器接收并处理 GET,但客户端永远坐着。为客户端启用调试日志,我看到 HttpClient 仍在等待响应的第一行。
There is no Exception thrown on the server, at least nothing logged anywhere, not by Tomcat, not by my webapp. According to debugging logs, there is every sign that the server successfully responded to the client. However, the client shows no sign of having received anything. The client hangs indefinitely in HttpClient.executeMethod. This becomes obvious after the session times out and the client takes action that causes another Thread to issue an HTTP POST. Of course, the POST fails because the session has expired. In some cases, hourshave elapsed between the session expiring and the client issuing a POST and discovering this fact. For this entire time, executeMethodis still waiting for the HTTP response line.
服务器上没有抛出异常,至少在任何地方都没有记录任何内容,不是Tomcat,不是我的webapp。根据调试日志,每一个迹象都表明服务器成功响应了客户端。但是,客户端没有收到任何东西的迹象。客户端在HttpClient.executeMethod 中无限期挂起。在会话超时并且客户端采取导致另一个线程发出 HTTP POST 的操作后,这一点变得明显。当然,POST 失败是因为会话已过期。在某些情况下,从会话到期到客户端发出 POST 并发现此事实之间已经过去了数小时。在这整个时间里,executeMethod仍在等待 HTTP 响应行。
When I use WireShark to see what is really going on at the wire level, this failure does not occur. That is, this failure will occur within a few hours for specific clients, but when WireShark is running at both ends, these same clients will run overnight, 14 hours, without a failure.
当我使用 WireShark 来查看在线级别的实际情况时,不会发生这种故障。也就是说,对于特定客户端,此故障将在几个小时内发生,但是当 WireShark 在两端运行时,这些相同的客户端将在一夜之间运行 14 小时,而不会出现故障。
Has anyone else encountered something like this? What in the world can cause it? I thought that TCP/IP guaranteed packet delivery even across short term network glitches. If I set an SO_TIMEOUT and immediately retry the request upon timeout, the retry always succeeds. (Of course, I first abortthe timed-out request and release the connection to ensure that a new socket will be used.)
有没有其他人遇到过这样的事情?到底是什么原因造成的?我认为即使在短期网络故障中,TCP/IP 也能保证数据包传送。如果我设置 SO_TIMEOUT 并在超时后立即重试请求,则重试总是成功。(当然,我首先中止超时请求并释放连接以确保将使用新的套接字。)
Thoughts? Ideas? Is there some TCP/IP setting available to Java or a registry setting in Windows that will enable more aggressive TCP/IP retries on lost packets?
想法?想法?是否有一些可用于 Java 的 TCP/IP 设置或 Windows 中的注册表设置可以对丢失的数据包启用更积极的 TCP/IP 重试?
采纳答案by Gary
Are you absolutely sure that the server has successfully sent the response to the clients that seem to fail? By this I mean the server has sent the response and the client has ack'ed that response back to the server. You should see this using wireshark on the server side. If you are sure this has occured on the server side and the client still does not see anything, you need to look further up the chain from the server. Are there any proxy/reverse proxy servers or NAT involved?
您绝对确定服务器已成功将响应发送给似乎失败的客户端吗?我的意思是服务器已经发送了响应,并且客户端已经将该响应返回给服务器。您应该在服务器端使用wireshark 看到这一点。如果您确定这已经发生在服务器端并且客户端仍然没有看到任何东西,您需要从服务器进一步查找链。是否涉及任何代理/反向代理服务器或 NAT?
The TCP transport is considered to be a reliable protocol, but it does not guarantee delivery. The TCP/IP stack of your OS will try pretty hard to get packets to the other end using TCP retransmissions. You should see these in wireshark on the server side if this is happening. If you see excessive TCP retransmissions, it is usually a network infrastructure issue - i.e. bad or misconfigured hardware/interfaces. TCP retransmissions works great for short network interruptions, but performs poorly on a network with a longer interruption. This is because the TCP/IP stack will only send retransmissions after a timer expires. This timer typically doubles after each unsuccessful retransmission. This is by design to avoid flooding an already problematic network with retransmissions. As you might imagine, this usually causes applications all sorts of timeout issues.
TCP 传输被认为是一种可靠的协议,但它不保证交付。您操作系统的 TCP/IP 堆栈将非常努力地使用 TCP 重新传输将数据包发送到另一端。如果发生这种情况,您应该在服务器端的 Wireshark 中看到这些。如果您看到过多的 TCP 重新传输,通常是网络基础设施问题——即硬件/接口配置错误或配置错误。TCP 重传对于短暂的网络中断非常有效,但在中断时间较长的网络上表现不佳。这是因为 TCP/IP 堆栈只会在计时器到期后发送重传。该计时器通常在每次不成功的重传后加倍。这是为了避免重传使已经有问题的网络泛滥而设计的。正如你想象的那样,
Depending on your network topology, you may also need to place probes/wireshark/tcpdump at other intermediate locations in the network. This will probably take some time to find out where the packets have gone.
根据您的网络拓扑,您可能还需要在网络中的其他中间位置放置探针/wireshark/tcpdump。这可能需要一些时间才能找出数据包的去向。
If I were you I would keep monitoring with wireshark on all ends until the problem re-occurs. It mostly likely will. But, it sounds like what you will ultimately find is what you already mentioned - flaky hardware. If fixing the flaky hardware is out of the question, you may need to just build in extra application level timeouts and retries to attempt to deal with the issue in software. It sounds like you started going down this path.
如果我是你,我会一直使用wireshark进行监控,直到问题再次发生。它很可能会。但是,听起来您最终会发现的是您已经提到的 - 片状硬件。如果修复不稳定的硬件是不可能的,您可能只需要构建额外的应用程序级别超时并重试以尝试在软件中处理问题。听起来你开始走这条路了。
回答by Simeon Pilgrim
If you are using long running GETs, you should timeout on the client side at twice the server timeout, as you have discovered.
如果您使用长时间运行的 GET,您应该在客户端以两倍于服务器超时的时间超时,正如您所发现的。
On a TCP where the client send a message and expects a response, if the server were to crash, and restart (lets say for the point of examples) then the client would still be waiting on the socket to get a response from the Server yet the server is no longer listening on that socket.
在客户端发送消息并期望响应的 TCP 上,如果服务器崩溃并重新启动(让我们举例说明),那么客户端仍将等待套接字以从服务器获得响应服务器不再侦听该套接字。
The client will only discover the socket is closed on the server end once it sends more data on that socket, and the server rejects this new data, and closes the socket.
客户端只有在该套接字上发送更多数据时才会发现服务器端的套接字已关闭,服务器拒绝此新数据,并关闭套接字。
This is why you should have client side time-outs on requests.
这就是为什么您应该对请求进行客户端超时。
But as your server is not crashing, if the server was multi threaded, and thread socket for that client closed, but at that time ( duration minutes) the client has an connectivity outage, then the end socket hand-shaking my be lost, and as you are not sending more data to the server from the client, your client is once again left hanging. This would tie in to your flaking connection observation.
但是由于您的服务器没有崩溃,如果服务器是多线程的,并且该客户端的线程套接字关闭,但当时(持续时间分钟)客户端连接中断,那么握手的结束套接字会丢失,并且由于您没有从客户端向服务器发送更多数据,您的客户端再次挂起。这将与您的剥落连接观察相关联。
回答by Lawrence Dol
Forgetting to flush or close the socket on the host side can intermittently have this effect for short responses depending on timing which could be affected by the presence of any monitoring mechanism.
忘记刷新或关闭主机端的套接字可能会间歇性地对短响应产生这种影响,具体取决于可能受任何监视机制存在影响的时间。
Especially forgetting to close will leave the socket dangling until GC gets around to reclaiming it and calls finalize().
特别是忘记关闭会使套接字悬空,直到 GC 开始回收它并调用 finalize()。
回答by D.Shawley
I haven't seen this one per se but I have seen similar problems with large UDP datagrams causing IP fragmentation which lead to congestion and ultimately dropped Ethernet frames. Since this is TCP/IP I wouldn't expect IP fragmentation to be a large issue since it is a stream-based protocol.
我自己还没有看到这个问题,但我看到过类似的问题,即大型 UDP 数据报会导致 IP 碎片,从而导致拥塞并最终丢失以太网帧。由于这是 TCP/IP,我不希望 IP 碎片成为一个大问题,因为它是基于流的协议。
One thing that I will note is that TCP does not guarantee delivery!It can't. What it does guarantee is that if you send byte Afollowed by byte B, then you will never receive byte Bbefore you have received byte A.
我要注意的一件事是TCP 不保证交付!它不能。它所做的保证是,如果你发送一个字节一个接着BYTE B,那么你将永远不会收到BYTE B已收到前字节进行。
With that said, I would connect the client machine and a monitoring machine to a hub. Run Wireshark on the monitoring machine and you should be able to see what is going on. I did run into problems related to both whitespace handling between HTTP requests and incorrect HTTP chunk sizes. Both issues were due to a hand written HTTP stack so this is only a problem if you are using a flaky stack.
话虽如此,我会将客户端机器和监控机器连接到集线器。在监控机器上运行 Wireshark,你应该能够看到发生了什么。我确实遇到了与 HTTP 请求之间的空白处理和不正确的 HTTP 块大小相关的问题。这两个问题都是由于手写的 HTTP 堆栈造成的,因此这仅在您使用易碎堆栈时才会出现问题。
回答by Peter Lawrey
If you are losing data, it is most likely due to a software bug, either in the reading or writing library.
如果丢失数据,很可能是由于读取或写入库中的软件错误造成的。
回答by BarrettJ
Could these computers have a virus/malware installed? Using wireshark installs winpcap (http://www.winpcap.org/) which may be overriding the changes the malware made (or the malware may simply detect it is being monitored and not attempt anything fishy).
这些计算机是否安装了病毒/恶意软件?使用wireshark 安装winpcap ( http://www.winpcap.org/),它可能会覆盖恶意软件所做的更改(或者恶意软件可能只是检测到它正在被监控而不尝试任何可疑的事情)。

