multithreading 如何拆分程序以充分利用多CPU、多核和超线程?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4743260/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 01:12:48  来源:igfitidea点击:

how to split program to fully utilize multi-CPU, multi-Core and hyper-Threading?

multithreadingmulticore

提问by teloon

I have a bunch of commands to execute for gene sequecing. For example:

我有一堆命令要执行基因测序。例如:

msclle_program -in 1.txt
msclle_program -in 2.txt
msclle_program -in 3.txt
      .........
msclle_program -in 10.txt

these commands are independent of each other.

这些命令是相互独立的。

The Environment is Linux Desktop, Intel i7(4 core/8 thread7, 12Gmemory

环境是Linux桌面,英特尔I7(4芯/ 8螺纹)× 712G存储器

I can split these commands into different n.sh programsand run them simultaneously.

我可以将这些命令拆分为不同的 n.sh 程序并同时运行它们。

My question is How can I fully utilize multi-CPU, multi-Core and hyper-Threading to make the program run faster?

我的问题是如何充分利用多CPU、多核和超线程来使程序运行得更快?

More specifically, how many program filesshould I split into?

更具体地说,我应该拆分成多少个程序文件

My own understandingis:

我个人的理解是:

  1. split into 7 program files. So each CPU will 100% run one program
  2. With one CPU, the CPU will utilize its multi-core and multi-thread by its own.
  1. 分成7个程序文件。所以每个 CPU 将 100% 运行一个程序
  2. 一个CPU,CPU自己多核多线程。

Is it True?

这是真的吗?

many thanks for ur comments.

非常感谢您的评论。

回答by

The answers is not simple or straightforward and splitting the task into one programme per CPU is likely to be non-optimal and may indeed be poor or even extremely poor.

答案并不简单或直截了当,将任务拆分为每个 CPU 的一个程序很可能不是最佳的,而且可能确实很差甚至非常差。

First, as I understand it, you have seven quad-core CPUs (presumably there are eight, but you're saving one for the OS?). If you run a single threaded process on each CPU, you will be using a single thread on a single core. The other three cores and all of the hyperthreads will not be used.

首先,据我所知,您有七个四核 CPU(大概有八个,但您为操作系统节省了一个?)。如果您在每个 CPU 上运行单线程进程,您将在单核上使用单线程。其他三个核心和所有超线程将不会被使用。

The hardware and OS cannot split a single thread over multiple cores.

硬件和操作系统不能在多个内核上拆分单个线程。

You could however run four single-threaded processes per CPU (one per core), or even eight (one per hyperthread). Whether or not this is optimal depends on the work being done by the processes; in particular, their working set size and memory access patterns, and upon the hardware cache arrangements; the number of levels of cache, their sizes and their sharing. Also the NUMA arrangement of the cores needs to be considered.

但是,您可以为每个 CPU 运行四个单线程进程(每个内核一个),甚至八个(每个超线程一个)。这是否最佳取决于流程所做的工作;特别是它们的工作集大小和内存访问模式,以及硬件缓存安排;缓存级别的数量、大小和共享。还需要考虑内核的 NUMA 排列。

Basically speaking, an extra thread has to give you quite a bit of speed-up to outweigh what it can cost you in cache utilization, main memory accesses and the disruption of pre-fetching.

基本上来说,一个额外的线程必须给你相当多的加速,以超过它在缓存利用率、主内存访问和预取中断方面的成本。

Furthermore, because the effects of the working set exceeding certain caching limits is profound, what seems fine for say one or two cores may be appalling for four or eight, so you can't even experiment with one core and assume the results are useful over eight.

此外,因为工作集超过某些缓存限制的影响是深远的,说一两个内核似乎很好,但四八个内核可能会令人震惊,所以你甚至不能用一个内核进行实验并假设结果有用八。

Having a quick look, I see i7 has a small L2 cache and a huge L3 cache. Given your data set, I wouldn't be surprised if there's a lot of data being processed. The question is whether or not it is sequentially processed (e.g. if prefetching will be effective). If the data is not sequentially processed, you may do better by reducing the number of concurrent processes, so their working sets tend to fit inside the L3 cache. I suspect if you run eight or sixteen processes, the L3 cache will be hammered - overflowed. OTOH, if your data access is non-sequential, the L3 cache prolly isn't going to save you anyway.

快速浏览一下,我看到 i7 有一个小的 L2 缓存和一个巨大的 L3 缓存。鉴于您的数据集,如果有大量数据正在处理,我不会感到惊讶。问题是它是否按顺序处理(例如,预取是否有效)。如果数据不是按顺序处理的,您可以通过减少并发进程的数量来做得更好,因此它们的工作集往往适合 L3 缓存。我怀疑如果您运行 8 个或 16 个进程,L3 缓存将被重击 - 溢出。OTOH,如果您的数据访问是非顺序的,那么 L3 缓存 prolly 无论如何都不会拯救您。

回答by Raghuram

You can spawn multiple processess and then assign each process to one cpu. You can use taskset -c to do this.

您可以生成多个进程,然后将每个进程分配给一个 CPU。您可以使用 taskset -c 来执行此操作。

Have a rolling number and increment to specify the processor number.

有一个滚动编号和增量来指定处理器编号。

回答by Joonas Pulakka

split into 7 program files. So each CPU will 100% run one program.

分成7个程序文件。所以每个 CPU 将 100% 运行一个程序。

This is approximately correct: if you have 7 single-threaded programs and 7 processing units, then each of them has one thread to run. This is optimal: less programs, and some processing units would be idle; more programs, and time would be wasted to alternating between them. Although, if you have 7 quad-core processors, then the optimum number of threads (from "CPU bound perspective") would be 28. This is simplified, as in reality there will be other programs around to share the CPU.

这大致正确:如果您有 7 个单线程程序和 7 个处理单元,那么每个程序都有一个线程要运行。这是最佳的:程序较少,一些处理单元会空闲;更多的程序,在它们之间交替会浪费时间。虽然,如果您有 7 个四核处理器,那么最佳线程数(从“CPU 绑定的角度”来看)将是 28。这是简化的,因为实际上周围会有其他程序来共享 CPU。

With one CPU, the CPU will utilize its multi-core and multi-thread by its own.

一个CPU,CPU自己多核多线程。

No. Whether or not all cores are in the single CPU or not makes little difference (it does make some difference in caching, though). Anyway, the processor won't do any multithreading by its own. It's the programmer's job. That's why making programs faster has become very challenging nowadays: until about 2005 or so it was free ride, as the clock frequencies were steadily rising, but now the limit has been reached, and speeding up programs requires splitting them into the growing number of processing units. It's one of the major reasons for the renaissance of functional programming.

不会。所有内核是否都在单个 CPU 中没有什么区别(不过,它确实在缓存方面有所不同)。无论如何,处理器不会自己进行任何多线程处理。这是程序员的工作。这就是为什么现在让程序更快变得非常具有挑战性:直到大约 2005 年左右,它都是搭便车,因为时钟频率稳步上升,但现在已经达到极限,加速程序需要将它们拆分成越来越多的处理单位。这是函数式编程复兴的主要原因之一。

回答by Olof Forshell

Why run them as separate processes? Consider running multiple threads in one process instead which would make both the memory footprint much smaller and lower the amount of process scheduling required.

为什么将它们作为单独的进程运行?考虑在一个进程中运行多个线程,这将使内存占用更小,并减少所需的进程调度量。

You could look at it this way (a bit over-simplified but still):

你可以这样看(有点过于简化,但仍然):

Consider dividing up your work into processable units (PU). You then want two or more cores to each process one PU at a time such that they don't interfere with each other and the more cores the more PUs you can process.

考虑将您的工作划分为可处理单元 (PU)。然后,您需要两个或更多内核一次处理一个 PU,这样它们就不会相互干扰,并且内核越多,您可以处理的 PU 就越多。

The work involved for processing one PU is input+processing+output (I+P+O). Since it is probably processing units from large memory structures containing perhaps millions or more the input and output have mostly to do with memory. With one core this is not a problem because no other core interferes with the memory accesses. With multiple cores the problem is moved basically to the nearest common resource, in this case the L3 cache giving cache input (CI) and cache output (CO). With two cores you would want CI+CO to equal P/2 or less because then the two cores could take turns accessing the nearest common resource (the L3 cache) and not interfere with each other. With three cores CI+CO would need to be P/3 and four or eight cores you would need CI+CO to equal P/4 or P/8.

处理一个 PU 所涉及的工作是输入+处理+输出(I+P+O)。由于它可能是来自包含可能数百万或更多的大型内存结构的处理单元,因此输入和输出主要与内存有关。对于一个内核,这不是问题,因为没有其他内核会干扰内存访问。对于多核,问题基本上转移到最近的公共资源,在这种情况下,L3 缓存提供缓存输入 (CI) 和缓存输出 (CO)。对于两个内核,您可能希望 CI+CO 等于 P/2 或更少,因为这样两个内核可以轮流访问最近的公共资源(L3 缓存)并且不会相互干扰。三核 CI+CO 需要为 P/3,四核或八核则需要 CI+CO 等于 P/4 或 P/8。

So the trick is to make the processing required for a PU reside completely inside a core and its own caches (L1 and L2). The more cores you have the larger the PUs should be (in relation to the I/O required) such that the PU stays isolated inside its core as long as possible and with all the data it needs available in its local caches.

所以诀窍是让 PU 所需的处理完全驻留在内核及其自己的缓存(L1 和 L2)中。您拥有的内核越多,PU 应该越大(相对于所需的 I/O),以便 PU 尽可能长时间地在其内核内保持隔离,并且它需要的所有数据都可以在其本地缓存中使用。

To sum it up you want the cores to do as much meaningful and efficient processing as possible while impacting the L3 cache as little as possible because the L3 cache is the bottleneck. It's a challenge to achieve such a balance but by no means impossible.

总而言之,您希望内核尽可能多地进行有意义和高效的处理,同时尽可能少地影响 L3 缓存,因为 L3 缓存是瓶颈。达到这样的平衡是一个挑战,但绝不是不可能的。

As you understand, the cores executing "traditional" multi-threaded administrative or web applications (where no care whatsoever is taken to economize on L3 accesses) will constantly be colliding with each other for access to the L3 cache and resources further out. It is not uncommon for multi-threaded programs running on multiple cores to be slower than if they'd been running on single cores.

如您所知,执行“传统”多线程管理或 Web 应用程序的内核(不采取任何措施来节省 L3 访问)将不断相互冲突,以访问更远的 L3 缓存和资源。在多核上运行的多线程程序比在单核上运行慢的情况并不少见。

Also, don't forget that OS work impacts the cache (a lot) as well. If you divide the problem into separate processes (as I mentioned above) you'll be calling in the OS to referee much more often than is absolutely neccessary.

另外,不要忘记操作系统工作也会影响缓存(很多)。如果您将问题划分为单独的进程(正如我上面提到的),您将比绝对必要的更频繁地调用操作系统来进行裁判。

My experience is that the existence, dos and don'ts of the problem are mostly unknown or not understood.

我的经验是,问题的存在、该做和不该做的事情大多是未知的或不被理解的。