Java 高负载 NIO TCP 服务器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17556901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-01 02:18:36  来源:igfitidea点击:

Java High-load NIO TCP server

javatcpniohigh-load

提问by Juriy

As a part of my research I'm writing an high-load TCP/IP echo server in Java. I want to serve about 3-4k of clients and see the maximum possible messages per second that I can squeeze out of it. Message size is quite small - up to 100 bytes. This work doesn't have any practical purpose - only a research.

作为我研究的一部分,我正在用 Java 编写一个高负载 TCP/IP 回显服务器。我想为大约 3-4k 个客户端提供服务,并查看每秒可以挤出的最大消息数。消息大小非常小 - 最多 100 个字节。这项工作没有任何实际目的——只是一项研究。

According to numerous presentations that I've seen (HornetQ benchmarks, LMAX Disruptor talks, etc), real-world high-load systems tend to serve millions of transactions per second (I believe Disruptor mentioned about 6 mils and and Hornet - 8.5). For example, this poststates that it possible to achieve up to 40M MPS. So I took it as a rough estimate of what should modern hardware be capable of.

根据我看过的大量演示(HornetQ 基准测试、LMAX Disruptor 讨论等),现实世界的高负载系统往往每秒处理数百万个事务(我相信 Disruptor 提到了大约 6 mils,而 Hornet - 8.5)。例如,这篇文章指出可以实现高达 40M MPS。所以我把它作为对现代硬件应该具备的能力的粗略估计。

I wrote simplest single-threaded NIO server and launched a load test. I was little surprised that I can get only about 100k MPS on localhost and 25k with actual networking. Numbers look quite small. I was testing on Win7 x64, core i7. Looking at CPU load - only one core is busy (which is expected on a single-threaded app), while the rest sit idle. However even if I load all 8 cores (including virtual) I will have no more than 800k MPS - not even close to 40 millions :)

我写了一个最简单的单线程 NIO 服务器并启动了一个负载测试。我很惊讶我在本地主机上只能获得大约 100k MPS,而在实际网络中只能获得 25k。数字看起来很小。我在 Win7 x64、core i7 上进行测试。查看 CPU 负载 - 只有一个内核忙(这在单线程应用程序上是预期的),而其余的则闲置。但是,即使我加载了所有 8 个内核(包括虚拟内核),我的 MPS 也不会超过 800k - 甚至不会接近 4000 万 :)

My question is: what is a typical pattern for serving massive amounts of messages to clients? Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores? Or I should look towards using multiple Selectors in my NIO code? Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them? Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?

我的问题是:向客户提供大量消息的典型模式是什么?我是否应该在单个 JVM 内的多个不同套接字上分配网络负载并使用某种负载均衡器(如 HAProxy)将负载分配到多个内核?或者我应该考虑在我的 NIO 代码中使用多个选择器?或者甚至可以在多个 JVM 之间分配负载并使用 Chronicle 在它们之间建立进程间通信?在像 CentOS 这样合适的服务器端操作系统上进行测试会产生很大的不同(也许是 Windows 会减慢速度)?

Below is the sample code of my server. It always answers with "ok" to any incoming data. I know that in real world I'd need to track the size of the message and be prepared that one message might be split between multiple reads however I'd like to keep things super-simple for now.

下面是我的服务器的示例代码。它总是对任何传入的数据回答“ok”。我知道在现实世界中,我需要跟踪消息的大小,并准备好一条消息可能会在多次读取之间拆分,但是我现在想让事情变得非常简单。

public class EchoServer {

private static final int BUFFER_SIZE = 1024;
private final static int DEFAULT_PORT = 9090;

// The buffer into which we'll read data when it's available
private ByteBuffer readBuffer = ByteBuffer.allocate(BUFFER_SIZE);

private InetAddress hostAddress = null;

private int port;
private Selector selector;

private long loopTime;
private long numMessages = 0;

public EchoServer() throws IOException {
    this(DEFAULT_PORT);
}

public EchoServer(int port) throws IOException {
    this.port = port;
    selector = initSelector();
    loop();
}

private void loop() {
    while (true) {
        try{
            selector.select();
            Iterator<SelectionKey> selectedKeys = selector.selectedKeys().iterator();
            while (selectedKeys.hasNext()) {
                SelectionKey key = selectedKeys.next();
                selectedKeys.remove();

                if (!key.isValid()) {
                    continue;
                }

                // Check what event is available and deal with it
                if (key.isAcceptable()) {
                    accept(key);
                } else if (key.isReadable()) {
                    read(key);
                } else if (key.isWritable()) {
                    write(key);
                }
            }

        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
}

private void accept(SelectionKey key) throws IOException {
    ServerSocketChannel serverSocketChannel = (ServerSocketChannel) key.channel();

    SocketChannel socketChannel = serverSocketChannel.accept();
    socketChannel.configureBlocking(false);
    socketChannel.setOption(StandardSocketOptions.SO_KEEPALIVE, true);
    socketChannel.setOption(StandardSocketOptions.TCP_NODELAY, true);
    socketChannel.register(selector, SelectionKey.OP_READ);

    System.out.println("Client is connected");
}

private void read(SelectionKey key) throws IOException {
    SocketChannel socketChannel = (SocketChannel) key.channel();

    // Clear out our read buffer so it's ready for new data
    readBuffer.clear();

    // Attempt to read off the channel
    int numRead;
    try {
        numRead = socketChannel.read(readBuffer);
    } catch (IOException e) {
        key.cancel();
        socketChannel.close();

        System.out.println("Forceful shutdown");
        return;
    }

    if (numRead == -1) {
        System.out.println("Graceful shutdown");
        key.channel().close();
        key.cancel();

        return;
    }

    socketChannel.register(selector, SelectionKey.OP_WRITE);

    numMessages++;
    if (numMessages%100000 == 0) {
        long elapsed = System.currentTimeMillis() - loopTime;
        loopTime = System.currentTimeMillis();
        System.out.println(elapsed);
    }
}

private void write(SelectionKey key) throws IOException {
    SocketChannel socketChannel = (SocketChannel) key.channel();
    ByteBuffer dummyResponse = ByteBuffer.wrap("ok".getBytes("UTF-8"));

    socketChannel.write(dummyResponse);
    if (dummyResponse.remaining() > 0) {
        System.err.print("Filled UP");
    }

    key.interestOps(SelectionKey.OP_READ);
}

private Selector initSelector() throws IOException {
    Selector socketSelector = SelectorProvider.provider().openSelector();

    ServerSocketChannel serverChannel = ServerSocketChannel.open();
    serverChannel.configureBlocking(false);

    InetSocketAddress isa = new InetSocketAddress(hostAddress, port);
    serverChannel.socket().bind(isa);
    serverChannel.register(socketSelector, SelectionKey.OP_ACCEPT);
    return socketSelector;
}

public static void main(String[] args) throws IOException {
    System.out.println("Starting echo server");
    new EchoServer();
}
}

回答by Rajiv

what is a typical pattern for serving massive amounts of messages to clients?

There are many possible patterns: An easy way to utilize all cores without going through multiple jvms is:

有许多可能的模式: 一种无需通过多个 jvm 即可利用所有内核的简单方法是:

  1. Have a single thread accept connections and read using a selector.
  2. Once you have enough bytes to constitute a single message, pass it on to another core using a construct like a ring buffer. The Disruptor Java framework is a good match for this. This is a good pattern if the processing needed to know what is a complete message is lightweight. For example if you have a length prefixed protocol you could wait till you get the expected number of bytes and then send it to another thread. If the parsing of the protocol is very heavy then you might overwhelm this single thread preventing it from accepting connections or reading bytes of the network.
  3. On your worker thread(s), which receive data from a ring buffer, do the actual processing.
  4. You write out the responses either on your worker threads or through some other aggregator thread.
  1. 让单个线程接受连接并使用选择器读取。
  2. 一旦你有足够的字节来构成一条消息,就可以使用像环形缓冲区这样的结构将它传递给另一个核心。Disruptor Java 框架非常适合这一点。如果需要知道什么是完整消息的处理是轻量级的,那么这是一个很好的模式。例如,如果您有一个长度前缀协议,您可以等到获得预期的字节数,然后将其发送到另一个线程。如果协议的解析非常繁重,那么您可能会压倒这个单线程,从而阻止它接受连接或读取网络字节。
  3. 在从环形缓冲区接收数据的工作线程上进行实际处理。
  4. 您可以在工作线程上或通过其他聚合器线程写出响应。

That's the gist of it. There are many more possibilities here and the answer really depends on the type of application you are writing. A few examples are:

这就是它的要点。这里有更多的可能性,答案实际上取决于您正在编写的应用程序类型。几个例子是:

  1. A CPU heavy stateless applicationsay an image processing application. The amount of CPU/GPU work done per request will probably be significantly higher than the overhead generated by a very naive inter-thread communication solution. In this case an easy solution is a bunch of worker threads pulling work from a single queue. Notice how this is a single queue instead of one queue per worker. The advantage is this is inherently load balanced. Each worker finishes it's work and then just polls the single-producer multiple-consumer queue. Even though this is a source of contention, the image-processing work (seconds?) should be far more expensive than any synchronization alternative.
  2. A pure IO applicatione.g. a stats server which just increments some counters for a request: Here you do almost no CPU heavy work. Most of the work is just reading bytes and writing bytes. A multi-threaded application might not give you significant benefit here. In fact it might even slow things down if the time it takes to queue items is more than the time it takes to process them. A single threaded Java server should be able to saturate a 1G link easily.
  3. Stateful applicationswhich require moderate amounts of processing e.g. a typical business application: Here every client has some state that determines how each request is handled. Assuming we go multi-threaded since the processing is non-trivial, we could affinitize clients to certain threads. This is a variant of the actor architecture:

    i) When a client first connects hash it to a worker. You might want to do this with some client id, so that if it disconnects and reconnects it is still assigned to the same worker/actor.

    ii) When the reader thread reads a complete request put it on the ring-buffer for the right worker/actor. Since the same worker always processes a particular client all the state should be thread local making all the processing logic simple and single-threaded.

    iii) The worker thread can write requests out. Always attempt to just do a write(). If all your data could not be written out only then do you register for OP_WRITE. The worker thread only needs to make select calls if there is actually something outstanding. Most writes should just succeed making this unnecessary. The trick here is balancing between select calls and polling the ring buffer for more requests. You could also employ a single writer thread whose only responsibility is to write requests out. Each worker thread can put it's responses on a ring buffer connecting it to this single writer thread. The single writer thread round-robin polls each incoming ring-buffer and writes out the data to clients. Again the caveat about trying write before select applies as does the trick about balancing between multiple ring buffers and select calls.

  1. 一个 CPU 重的无状态应用程序说一个图像处理应用程序。每个请求完成的 CPU/GPU 工作量可能会明显高于非常幼稚的线程间通信解决方案产生的开销。在这种情况下,一个简单的解决方案是一堆工作线程从单个队列中提取工作。请注意这是一个单独的队列,而不是每个工作人员一个队列。优点是这本身就是负载平衡的。每个工人完成它的工作,然后轮询单生产者多消费者队列。尽管这是一个争用的来源,但图像处理工作(几秒钟?)应该比任何同步替代方案都要昂贵得多。
  2. 一个纯粹的 IO 应用程序,例如一个统计服务器,它只是为一个请求增加一些计数器:在这里你几乎没有做 CPU 繁重的工作。大多数工作只是读取字节和写入字节。多线程应用程序在这里可能不会给您带来显着的好处。事实上,如果排队项目所需的时间超过处理它们所需的时间,它甚至可能会减慢速度。单线程 Java 服务器应该能够轻松地使 1G 链路饱和。
  3. 需要适度处理量的有状态应用程序,例如典型的业务应用程序:这里每个客户端都有一些状态来确定如何处理每个请求。假设我们采用多线程,因为处理很重要,我们可以将客户端关联到某些线程。这是actor架构的一个变体:

    i) 当客户端第一次将它哈希连接到一个工人时。您可能希望使用某个客户端 ID 执行此操作,以便在断开连接并重新连接时仍将其分配给同一个工作人员/演员。

    ii) 当读取器线程读取一个完整的请求时,将其放在正确的工作人员/参与者的环形缓冲区中。由于同一个工作者总是处理一个特定的客户端,所有的状态都应该是线程本地的,使得所有的处理逻辑都变得简单和单线程。

    iii) 工作线程可以写出请求。总是尝试只做一个 write()。如果您的所有数据都无法写出,那么您是否注册 OP_WRITE。如果确实有一些未完成的事情,工作线程只需要进行选择调用。大多数写入应该会成功使这变得不必要。这里的技巧是在选择调用和轮询环形缓冲区以获取更多请求之间取得平衡。您还可以使用单个编写器线程,其唯一职责是将请求写出。每个工作线程都可以将它的响应放在一个环形缓冲区上,将它连接到这个单一的编写器线程。单写入线程轮询轮询每个传入的环形缓冲区并将数据写出到客户端。

As you point out there are many other options:

正如您所指出的,还有许多其他选择:

Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores?

Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores?

You can do this, but IMHO this is not the best use for a load balancer. This does buy you independent JVMs that can fail on their own but will probably be slower than writing a single JVM app that is multi-threaded. The application itself might be easier to write though since it will be single threaded.

您可以这样做,但恕我直言,这不是负载平衡器的最佳用途。这确实为您购买了独立的 JVM,这些 JVM 可能会自行失败,但可能比编写多线程的单个 JVM 应用程序要慢。应用程序本身可能更容易编写,因为它将是单线程的。

Or I should look towards using multiple Selectors in my NIO code?

You can do this too. Look at Ngnix architecture for some hints on how to do this.

你也可以这样做。查看 Ngnix 架构以获取有关如何执行此操作的一些提示。

Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them?This is also an option. Chronicle gives you an advantage that memory mapped files are more resilient to a process quitting in the middle. You still get plenty of performance since all communication is done through shared memory.

Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them?这也是一种选择。Chronicle 为您提供了一个优势,即内存映射文件对中间退出的进程更具弹性。由于所有通信都是通过共享内存完成的,因此您仍然可以获得足够的性能。

Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?

I don't know about this. Unlikely. If Java uses the native Windows APIs to the fullest, it shouldn't matter as much. I am highly doubtful of the 40 million transactions/sec figure (without a user space networking stack + UDP) but the architectures I listed should do pretty well.

我不知道这件事。不太可能。如果 Java 最充分地使用本机 Windows API,那么它就不那么重要了。我非常怀疑 4000 万事务/秒的数字(没有用户空间网络堆栈 + UDP),但我列出的架构应该做得很好。

These architectures tend to do well since they are single-writer architectures that use bounded array based data structures for inter thread communication. Determine if multi-threaded is even the answer. In many cases it is not needed and can lead to slowdown.

这些架构往往表现良好,因为它们是使用基于有界数组的数据结构进行线程间通信的单编写器架构。确定多线程是否甚至是答案。在许多情况下,它不是必需的,并且可能导致速度减慢。

Another area to look into is memory allocation schemes. Specifically the strategy to allocate and reuse buffers could lead to significant benefits. The right buffer reuse strategy is dependent on application. Look at schemes like buddy-memory allocation, arena allocation etc to see if they can benefit you. The JVM GC does plenty fine for most work loads though so always measure before you go down this route.

另一个需要研究的领域是内存分配方案。特别是分配和重用缓冲区的策略可能会带来显着的好处。正确的缓冲区重用策略取决于应用程序。查看诸如伙伴内存分配、竞技场分配等方案,看看它们是否可以使您受益。JVM GC 对大多数工作负载来说已经足够好了,所以在你走这条路之前总是要进行测量。

Protocol design has a big effect on performance too. I tend to prefer length prefixed protocols because they let you allocate buffers of right sizes avoiding lists of buffers and/or buffer merging. Length prefixed protocols also make it easy to decide when to handover a request - just check num bytes == expected. The actual parsing can be done by the workers thread. Serialization and deserialization extends beyond length-prefixed protocols. Patterns like flyweight patterns over buffers instead of allocations helps here. Look at SBEfor some of these principles.

协议设计对性能也有很大影响。我倾向于更喜欢长度前缀协议,因为它们允许您分配正确大小的缓冲区,避免缓冲区列表和/或缓冲区合并。长度前缀协议还可以轻松决定何时移交请求——只需检查即可num bytes == expected。实际的解析可以由工作线程完成。序列化和反序列化超出了以长度为前缀的协议。像享元模式在缓冲区而不是分配上的模式在这里有帮助。查看SBE以了解其中一些原则。

As you can imagine an entire treatise could be written here. This should set you in the right direction. Warning: Always measure and make sure you need more performance than the simplest option. It's easy to get sucked into a never ending black-hole of performance improvements.

可以想象,这里可以写出整篇论文。这应该让你朝着正确的方向前进。警告:始终测量并确保您需要比最简单的选项更高的性能。很容易陷入永无止境的性能改进黑洞。

回答by user207421

Your logic around writing is faulty. You should attempt the write immediately you have data to write. If the write()returns zero it is thentime to register for OP_WRITE, retry the write when the channel becomes writable, and deregister for OP_WRITEwhen the write has succeeded. You're adding a massive amount of latency here. You're adding even more latency by deregistering for OP_READwhile you're doing all that.

你关于写作的逻辑是错误的。您应该立即尝试写入您有数据要写入的内容。如果write()返回零它是那么的时间为OP_WRITE登记,重试写入当信道变为可写,并注销了OP_WRITE,当写成功。你在这里增加了大量的延迟。通过OP_READ在执行所有操作时取消注册,您会增加更多延迟。

回答by Alexander Torstling

You will acheive tops a few hundred thousand requests per second with regular hardware. At least that is my experience trying to build similar solutions, and the Tech Empower Web Frameworks Benchmarkseems to agree as well.

使用常规硬件,您每秒将达到数十万个请求。至少这是我尝试构建类似解决方案的经验,Tech Empower Web Frameworks Benchmark似乎也同意这一点。

The best approach, generally, depends on whether you have io-bound or cpu-bound loads.

通常,最好的方法取决于您是否有 io-bound 或 cpu-bound 负载。

For io-bound loads (high latency), you need to do async io with many threads. For best performance you should try to void handoffs between threads as much as possible. So having a dedicated selector thread and another threadpool for processing is slower than having a threadpool where every thread does either selection or processing, so that that a request gets handled by a single thread in the best case (if io is immediately available). This type of setup is more complicated to code but fast and I don't believe that any async web framework exploits this fully.

对于 io-bound 负载(高延迟),您需要使用多个线程进行异步 io。为了获得最佳性能,您应该尽量避免线程之间的切换。因此,拥有一个专用的选择器线程和另一个用于处理的线程池比拥有一个每个线程都进行选择或处理的线程池要慢,因此在最好的情况下(如果 io 立即可用),请求由单个线程处理。这种类型的设置编码起来更复杂,但速度很快,我不相信任何异步 Web 框架都充分利用了这一点。

For cpu-bound loads one thread per request is usually the fastest, since you avoid context switches.

对于 cpu 绑定负载,每个请求一个线程通常是最快的,因为您避免了上下文切换。