multithreading 多线程:线程数多于内核数有什么意义?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3126154/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 01:09:05  来源:igfitidea点击:

Multithreading: What is the point of more threads than cores?

multithreadinghardwarecpu-cores

提问by Nick Heiner

I thought the point of a multi-core computer is that it could run multiple threads simultaneously. In that case, if you have a quad-core machine, what's the point of having more than 4 threads running at a time? Wouldn't they just be stealing time from each other?

我认为多核计算机的意义在于它可以同时运行多个线程。在这种情况下,如果您有一台四核机器,一次运行 4 个以上的线程有什么意义?他们不会只是在互相偷窃时间吗?

回答by David

The answer revolves around the purpose of threads, which is parallelism: to run several separate lines of execution at once. In an 'ideal' system, you would have one thread executing per core: no interruption. In reality this isn't the case. Even if you have four cores and four working threads, your process and it threads will constantly be being switched out for other processes and threads. If you are running any modern OS, every process has at least one thread, and many have more. All these processes are running at once. You probably have several hundred threads all running on your machine right now. You won't ever get a situation where a thread runs without having time 'stolen' from it. (Well, you might if it's running real-time, if you're using a realtime OS or, even on Windows, use a real-time thread priority. But it's rare.)

答案围绕线程的目的,即并行性:同时运行多个单独的执行行。在“理想的”系统中,每个内核将执行一个线程:没有中断。事实上,情况并非如此。即使您有四个内核和四个工作线程,您的进程和它的线程也会不断地被其​​他进程和线程切换出去。如果您正在运行任何现代操作系统,那么每个进程都至少有一个线程,而且许多进程有更多线程。所有这些进程同时运行。您现在可能有数百个线程都在您的机器上运行。您永远不会遇到线程运行而没有“偷走”时间的情况。(好吧,如果它实时运行,你可能会,如果您使用的是实时操作系统,或者甚至在 Windows 上,请使用实时线程优先级。但很少见。)

With that as background, the answer: Yes, more than four threads on a true four-core machine may give you a situation where they 'steal time from each other', but only if each individual thread needs 100% CPU. If a thread is not working 100% (as a UI thread might not be, or a thread doing a small amount of work or waiting on something else) then another thread being scheduled is actually a good situation.

以此为背景,答案是:是的,在真正的四核机器上超过四个线程可能会给您一种情况,即它们“相互窃取时间”,但前提是每个单独的线程都需要 100% CPU。如果一个线程没有 100% 工作(因为 UI 线程可能不是,或者一个线程在做少量工作或等待其他事情),那么正在调度的另一个线程实际上是一个很好的情况。

It's actually more complicated than that:

它实际上比这更复杂:

  • What if you have five bits of work that all need to be done at once? It makes more sense to run them all at once, than to run four of them and then run the fifth later.

  • It's rare for a thread to genuinely need 100% CPU. The moment it uses disk or network I/O, for example, it may be potentially spend time waiting doing nothing useful. This is a very common situation.

  • If you have work that needs to be run, one common mechanism is to use a threadpool. It might seem to make sense to have the same number of threads as cores, yet the .Net threadpool has up to 250 threads available per processor. I'm not certain why they do this, but my guess is to do with the size of the tasks that are given to run on the threads.

  • 如果您有五项工作需要同时完成怎么办?一次运行它们比运行四个然后再运行第五个更有意义。

  • 一个线程真正需要 100% CPU 的情况很少见。例如,当它使用磁盘或网络 I/O 时,它可能会花时间等待做任何有用的事情。这是一种非常普遍的情况。

  • 如果您有需要运行的工作,一种常见的机制是使用线程池。拥有与内核相同数量的线程似乎很有意义,但.Net 线程池每个处理器最多有 250 个可用线程。我不确定他们为什么这样做,但我的猜测是与在线程上运行的任务的大小有关。

So: stealing time isn't a bad thing (and isn't really theft, either: it's how the system is supposed to work.) Write your multithreaded programs based on the kind of work the threads will do, which may not be CPU-bound. Figure out the number of threads you need based on profiling and measurement. You may find it more useful to think in terms of tasks or jobs, rather than threads: write objects of work and give them to a pool to be run. Finally, unless your program is truly performance-critical, don't worry too much :)

所以:窃取时间并不是一件坏事(也不是真正的盗窃:这是系统应该如何工作的。)根据线程将执行的工作类型编写多线程程序,这可能不是 CPU -边界。根据分析和测量计算出您需要的线程数。您可能会发现从任务或作业的角度思考比线程更有用:编写工作对象并将它们提供给要运行的池。最后,除非您的程序确实对性能至关重要,否则不要太担心:)

回答by Amber

Just because a thread exists doesn't always mean it's actively running. Many applications of threads involve some of the threads going to sleep until it's time for them to do something - for instance, user input triggering threads to wake up, do some processing, and go back to sleep.

仅仅因为线程存在并不总是意味着它正在积极运行。线程的许多应用程序都涉及一些线程进入休眠状态,直到它们需要做某事为止——例如,用户输入触发线程唤醒、执行一些处理并返回休眠状态。

Essentially, threads are individual tasks that can operate independently of one another, with no need to be aware of the progress of another task. It's quite possible to have more of these than you have ability to run simultaneously; they're still useful for convenience even if they sometimes have to wait in line behind one another.

从本质上讲,线程是可以相互独立运行的单个任务,无需了解另一任务的进度。很可能拥有比您同时运行的能力更多的这些;即使他们有时不得不互相排队等候,但它们仍然对方便有用。

回答by JustJeff

The point is that, despite not getting any real speedup when thread count exceeds core count, you can use threads to disentangle pieces of logic that should not have to be interdependent.

关键是,尽管当线程数超过核心数时没有得到任何真正的加速,您可以使用线程来解开不应该相互依赖的逻辑片段。

In even a moderately complex application, using a single thread try to do everything quickly makes hash of the 'flow' of your code. The single thread spends most of its time polling this, checking on that, conditionally calling routines as needed, and it becomes hard to see anything but a morass of minutiae.

即使在一个中等复杂的应用程序中,使用单个线程尝试快速完成所有操作会使代码的“流程”散列。单线程大部分时间都花在轮询、检查、根据需要有条件地调用例程上,除了一堆细节之外,很难看到任何东西。

Contrast this with the case where you can dedicate threads to tasks so that, looking at any individual thread, you can see what that thread is doing. For instance, one thread might block waiting on input from a socket, parse the stream into messages, filter messages, and when a valid message comes along, pass it off to some other worker thread. The worker thread can work on inputs from a number of other sources. The code for each of these will exhibit a clean, purposeful flow, without having to make explicit checks that there isn't something else to do.

将此与您可以将线程专用于任务的情况进行对比,以便查看任何单个线程,您可以看到该线程正在做什么。例如,一个线程可能会阻塞等待来自套接字的输入,将流解析为消息,过滤消息,并且当出现有效消息时,将其传递给其他某个工作线程。工作线程可以处理来自许多其他来源的输入。这些代码中的每一个都将展示一个干净的、有目的的流程,而无需明确检查没有其他事情要做。

Partitioning the work this way allows your application to rely on the operating system to schedule what to do next with the cpu, so you don't have to make explicit conditional checks everywhere in your application about what might block and what's ready to process.

以这种方式对工作进行分区允许您的应用程序依赖操作系统来安排 CPU 的下一步操作,因此您不必在应用程序中的任何地方都进行明确的条件检查,以确定哪些可能会阻塞以及哪些已准备好处理。

回答by IceArdor

If a thread is waiting for a resource (such as loading a value from RAM into a register, disk I/O, network access, launch a new process, query a database, or wait for user input), the processor can work on a different thread, and return to the first thread once the resource is available. This reduces the time the CPU spends idle, as the CPU can perform millions of operations instead of sitting idle.

如果线程正在等待资源(例如将 RAM 中的值加载到寄存器、磁盘 I/O、网络访问、启动新进程、查询数据库或等待用户输入),则处理器可以处理不同的线程,并在资源可用时返回到第一个线程。这减少了 CPU 空闲的时间,因为 CPU 可以执行数百万次操作而不是闲置。

Consider a thread that needs to read data off a hard drive. In 2014, a typical processor core operates at 2.5 GHz and may be able to execute 4 instructions per cycle. With a cycle time of 0.4 ns, the processor can execute 10 instructions per nanosecond. With typical mechanical hard drive seek times are around 10 milliseconds, the processor is capable of executing 100 million instructions in the time it takes to read a value from the hard drive. There may be significant performance improvements with hard drives with a small cache (4 MB buffer) and hybrid drives with a few GB of storage, as data latency for sequential reads or reads from the hybrid section may be several orders of magnitude faster.

考虑一个需要从硬盘驱动器读取数据的线程。2014 年,一个典型的处理器内核以 2.5 GHz 的频率运行,每个周期可能能够执行 4 条指令。以 0.4 ns 的周期时间,处理器可以每纳秒执行 10 条指令。典型的机械硬盘驱动器寻道时间约为 10 毫秒,处理器能够在从硬盘驱动器读取值所需的时间内执行 1 亿条指令。具有小缓存(4 MB 缓冲区)的硬盘驱动器和具有几 GB 存储空间的混合驱动器可能会显着提高性能,因为顺序读取或从混合部分读取的数据延迟可能快几个数量级。

A processor core can switch between threads (cost for pausing and resuming a thread is around 100 clock cycles) while the first thread waits for a high latency input (anything more expensive than registers (1 clock) and RAM (5 nanoseconds)) These include disk I/O, network access (latency of 250ms), reading data off a CD or a slow bus, or a database call. Having more threads than cores means useful work can be done while high-latency tasks are resolved.

处理器内核可以在线程之间切换(暂停和恢复线程的成本约为 100 个时钟周期),而第一个线程等待高延迟输入(比寄存器(1 个时钟)和 RAM(5 纳秒)更昂贵的任何东西)这些包括磁盘 I/O、网络访问(250 毫秒的延迟)、从 CD 或慢速总线读取数据或数据库调用。拥有比核心更多的线程意味着可以在解决高延迟任务的同时完成有用的工作。

The CPU has a thread scheduler that assigns priority to each thread, and allows a thread to sleep, then resume after a predetermined time. It is the thread scheduler's job to reduce thrashing, which would occur if each thread executed just 100 instructions before being put to sleep again. The overhead of switching threads would reduce the total useful throughput of the processor core.

CPU 有一个线程调度器,它为每个线程分配优先级,并允许一个线程休眠,然后在预定时间后恢复。线程调度程序的工作是减少颠簸,如果每个线程在再次进入睡眠状态之前只执行了 100 条指令,就会发生颠簸。切换线程的开销会降低处理器内核的总可用吞吐量。

For this reason, you may want to break up your problem in to a reasonable number of threads. If you were writing code to perform matrix multiplication, creating one thread per cell in the output matrix might be excessive, whereas one thread per row or per nrows in the output matrix might reduce the overhead cost of creating, pausing, and resuming threads.

因此,您可能希望将问题分解为合理数量的线程。如果您正在编写代码来执行矩阵乘法,则在输出矩阵中为每个单元创建一个线程可能过多,而在输出矩阵中每行或每n行一个线程可能会降低创建、暂停和恢复线程的开销成本。

This is also why branch prediction is important. If you have an if statement that requires loading a value from RAM but the body of the if and else statements use values already loaded into registers, the processor may execute one or both branches before the condition has been evaluated. Once the condition returns, the processor will apply the result of the corresponding branch and discard the other. Performing potentially useless work here is probably better than switching to a different thread, which could lead to thrashing.

这也是分支预测很重要的原因。如果您的 if 语句需要从 RAM 加载值,但 if 和 else 语句的主体使用已加载到寄存器中的值,则处理器可能会在条件评估之前执行一个或两个分支。一旦条件返回,处理器将应用相应分支的结果并丢弃另一个。在这里执行可能无用的工作可能比切换到可能导致颠簸的不同线程更好。

As we have moved away from high clock-speed single-core processors to multi-core processors, chip design has focused on cramming more cores per die, improving on-chip resource sharing between cores, better branch prediction algorithms, better thread switching overhead, and better thread scheduling.

随着我们从高时钟速度的单核处理器转向多核处理器,芯片设计的重点是在每个芯片上塞满更多的内核,改善内核之间的片上资源共享,更好的分支预测算法,更好的线程切换开销,和更好的线程调度。

回答by JUST MY correct OPINION

Most of the answers above talk about performance and simultaneous operation. I'm going to approach this from a different angle.

上面的大部分答案都在谈论性能和同步操作。我将从不同的角度来处理这个问题。

Let's take the case of, say, a simplistic terminal emulation program. You have to do the following things:

让我们以一个简单的终端仿真程序为例。你必须做以下事情:

  • watch for incoming characters from the remote system and display them
  • watch for stuff coming from the keyboard and send them to the remote system
  • 监视来自远程系统的传入字符并显示它们
  • 观察来自键盘的东西并将它们发送到远程系统

(Real terminal emulators do more, including potentially echoing the stuff you type onto the display as well, but we'll pass over that for now.)

(真正的终端模拟器做得更多,包括潜在地将您输入的内容回显到显示器上,但我们现在将忽略它。)

Now the loop for reading from the remote is simple, as per the following pseudocode:

现在从远程读取的循环很简单,按照以下伪代码:

while get-character-from-remote:
    print-to-screen character

The loop for monitoring the keyboard and sending is also simple:

监控键盘和发送的循环也很简单:

while get-character-from-keyboard:
    send-to-remote character

The problem, though, is that you have to do this simultaneously. The code now has to look more like this if you don't have threading:

但是,问题是您必须同时执行此操作。如果您没有线程,代码现在必须看起来更像这样:

loop:
    check-for-remote-character
    if remote-character-is-ready:
        print-to-screen character
    check-for-keyboard-entry
    if keyboard-is-ready:
        send-to-remote character

The logic, even in this deliberately simplified example that doesn't take into account real-world complexity of communications, is quite obfuscated. With threading, however, even on a single core, the two pseudocode loops can exist independently without interlacing their logic. Since both threads will be mostly I/O-bound, they don't put a heavy load on the CPU, even though they are, strictly speaking, more wasteful of CPU resources than the integrated loop would be.

逻辑,即使在这个故意简化的例子中,没有考虑到现实世界的通信复杂性,也是相当模糊的。然而,使用线程,即使在单个内核上,两个伪代码循环也可以独立存在,而无需交织其逻辑。由于这两个线程将主要受 I/O 限制,因此它们不会给 CPU 带来沉重的负载,尽管严格来说,它们比集成循环更浪费 CPU 资源。

Now of course real-world usage is more complicated than the above. But the complexity of the integrated loop goes up exponentially as you add more concerns to the application. The logic gets ever more fragmented and you have to start using techniques like state machines, coroutines, et al to get things manageable. Manageable, but not readable. Threading keeps the code more readable.

现在当然现实世界的使用比上面的更复杂。但是,随着您向应用程序添加更多关注点,集成循环的复杂性呈指数级上升。逻辑变得越来越碎片化,您必须开始使用状态机、协程等技术来使事情变得可管理。可管理,但不可读。线程使代码更具可读性。

So why would you not use threading?

那么为什么不使用线程呢?

Well, if your tasks are CPU-bound instead of I/O-bound, threading actually slows your system down. Performance will suffer. A lot, in many cases. ("Thrashing" is a common problem if you drop too many CPU-bound threads. You wind up spending more time changing the active threads than you do running the contents of the threads themselves.) Also, one of the reasons the logic above is so simple is that I've very deliberately chosen a simplistic (and unrealistic) example. If you wanted to echo what was typed to the screen then you've got a new world of hurt as you introduce locking of shared resources. With only one shared resource this isn't so much a problem, but it does start to become a bigger and bigger problem as you have more resources to share.

好吧,如果您的任务受 CPU 限制而不是受 I/O 限制,线程实际上会减慢您的系统速度。性能会受到影响。很多,在很多情况下。(“抖动”是一个常见问题,如果您丢弃过多的 CPU 绑定线程。您最终会花费更多时间更改活动线程而不是运行线程本身的内容。)此外,上述逻辑的原因之一是如此简单以至于我特意选择了一个简单化(且不切实际)的示例。如果您想回显输入到屏幕上的内容,那么当您引入共享资源的锁定时,您将面临一个新的伤害世界。只有一个共享资源,这不是什么大问题,但随着您有更多资源要共享,它确实开始成为一个越来越大的问题。

So in the end, threading is about many things. For example, it's about making I/O-bound processes more responsive (even if less efficient overall) as some have already said. It's also about making logic easier to follow (but only if you minimize shared state). It's about a lot of stuff, and you have to decide if its advantages outweigh its disadvantages on a case by case basis.

所以说到底,线程是关于很多事情的。例如,正如一些人已经说过的那样,它是关于使 I/O 绑定进程更具响应性(即使整体效率较低)。这也是为了使逻辑更易于遵循(但前提是您最小化共享状态)。它涉及很多东西,您必须根据具体情况决定其优点是否大于缺点。

回答by fishtoprecords

I strongly disagree with @kyoryu's assertion that the ideal number is one thread per CPU.

我强烈不同意@kyoryu 的说法,即理想的数量是每个 CPU 一个线程。

Think about it this way: why do we have multi-processing operating systems? For most of computer history, nearly all computers had one CPU. Yet from the 1960s on, all "real" computers had multi-processing (aka multi-tasking) operating systems.

这样想:为什么我们有多处理操作系统?在计算机历史的大部分时间里,几乎所有的计算机都有一个 CPU。然而,从 1960 年代开始,所有“真正的”计算机都具有多处理(也称为多任务)操作系统。

You run multiple programs so that one can run while others are blocked for things like IO.

您运行多个程序,以便一个程序可以运行,而其他程序因 IO 等问题而被阻止。

lets set aside arguments about whether Windows versions before NT were multi-tasking. Since then, every real OS had multi-tasking. Some don't expose it to users, but its there anyway, doing things like listening to the cellphone radio, talking to the GPS chip, accepting mouse input, etc.

让我们搁置关于 NT 之前的 Windows 版本是否是多任务的争论。从那时起,每个真正的操作系统都有多任务处理。有些不向用户公开它,但无论如何它都在那里,做一些事情,比如听手机收音机,与 GPS 芯片交谈,接受鼠标输入等。

Threads are just tasks that are a bit more efficient. There is no fundamental difference between a task, process, and thread.

线程只是效率更高的任务。任务、进程和线程之间没有根本区别。

A CPU is a terrible thing to waste, so have lots of things ready to use it when you can.

CPU 浪费是一件可怕的事情,因此请尽可能准备好大量使用它的东西。

I will agree that with most procedural languages, C, C++, Java etc, writing proper thread safe code is a lot of work. With 6 core CPUs on the market today, and 16 core CPUs not far away, I expect that folks will move away from these old languages, as multi-threading is more and more of a critical requirement.

我同意大多数过程语言,C、C++、Java 等,编写适当的线程安全代码需要大量工作。现在市场上有 6 核 CPU,16 核 CPU 不远了,我预计人们会远离这些旧语言,因为多线程越来越成为关键需求。

Disagreement with @kyoryu is just IMHO, the rest is fact.

与@kyoryu 的分歧只是恕我直言,其余的都是事实。

回答by Cam

Although you can certainly use threads for speeding up calculations depending on your hardware, one of their main uses is to do more than one thing at a time for user-friendliness reasons.

尽管您当然可以根据您的硬件使用线程来加速计算,但出于用户友好的原因,线程的主要用途之一是一次做不止一件事。

For example, if you have to do some processing in the background and also remain responsive to UI input, you can use threads. Without threads, the user interface would hang every time you tried to do any heavy processing.

例如,如果您必须在后台进行一些处理并保持对 UI 输入的响应,则可以使用线程。如果没有线程,每次您尝试进行任何繁重的处理时,用户界面都会挂起。

Also see this related question: Practical uses for threads

另请参阅此相关问题:线程的实际用途

回答by tobiw

Imagine a Web server that has to serve an arbitrary number of requests. You have to serve the requests in parallel because otherwise each new request has to wait until all the other requests have been completed (including sending the response over the Internet). In this case, most web servers have way less cores than the number of requests they usually serve.

想象一下必须为任意数量的请求提供服务的 Web 服务器。您必须并行处理请求,否则每个新请求都必须等到所有其他请求都完成(包括通过 Internet 发送响应)。在这种情况下,大多数 Web 服务器的内核数比它们通常服务的请求数少得多。

It also makes it easier for the developer of the server: You only have to write a thread program that serves a request, you don't have to think about storing multiple requests, the order you serve them, and so on.

这也让服务器的开发者更容易:你只需要编写一个服务请求的线程程序,你不必考虑存储多个请求,你服务它们的顺序等等。

回答by Puppy

Many threads will be asleep, waiting for user input, I/O, and other events.

许多线程将处于休眠状态,等待用户输入、I/O 和其他事件。

回答by Anon

Threads can help with responsiveness in UI applications. Additionally, you can use threads to get more work out of your cores. For instance, on a single core, you can have one thread doing IO and another doing some computation. If it were single threaded, the core could essentially be idle waiting for the IO to complete. That's a pretty high level example, but threads can definitely be used to pound your cpu a bit harder.

线程可以帮助提高 UI 应用程序的响应能力。此外,您可以使用线程从内核中获得更多工作。例如,在单核上,您可以让一个线程执行 IO,另一个线程执行一些计算。如果它是单线程的,那么内核实际上可能处于空闲状态,等待 IO 完成。这是一个非常高级的例子,但线程绝对可以用来更努力地敲打你的 CPU。