multithreading 4 个内核的 8 个逻辑线程并行运行速度最多可提高 4 倍?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10403201/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 01:19:06  来源:igfitidea点击:

8 logical threads at 4 cores will at a maximum run 4 times faster in parallel?

multithreadingopenmpmulticore

提问by Cisum Inas

I'm benchmarking software which executes 4x faster on Intel 2670QM then my serial version using all 8 of my 'logical' threads. I would like some community feedback on my perception of the benchmarking results.

我正在基准测试软件,它在 Intel 2670QM 上的执行速度比我的串行版本快 4 倍,使用我的所有 8 个“逻辑”线程。我想要一些关于我对基准测试结果看法的社区反馈

When I am using 4 Threads on 4 cores I get a speed up of 4x, the entire algorithm is executed in parallell. This seems logical to me since 'Amdhals law' predicts it. Windows task manager tells me I'm using 50% of the CPU.

当我在 4 个内核上使用 4 个线程时,我的速度提高了 4 倍,整个算法是并行执行的。这对我来说似乎合乎逻辑,因为“Amdhals 定律”预测了这一点。Windows 任务管理器告诉我我正在使用 50% 的 CPU。

However if I execute the same software on all 8 threads, I get, once again a speed up of 4x and nota speed up of 8x.

但是,如果我在所有 8 个线程上执行相同的软件,我将再次获得 4 倍的加速而不是8 倍的加速。

If I have understood this correctly: my CPU has 4 cores with a Frequency of 2.2GHZ individually but the Frequency is divided into 1.1GHZ when applied to 8 'logical' threads and the same follows for the rest of the component such as the cache memory? If this is true then why does the task manager claim only 50% of my CPU is being used?

如果我理解正确的话:我的 CPU 有 4 个内核,频率分别为 2.2GHZ,但是当应用于 8 个“逻辑”线程时,频率被分为 1.1GHZ,其余组件(例如高速缓存)也是如此? 如果这是真的,那么为什么任务管理器声称我的 CPU 只使用了 50%?

#define NumberOfFiles 8
...
char startLetter ='a';
#pragma omp parallel for shared(startLetter)
for(int f=0; f<NumberOfFiles; f++){
    ...
}

I am not including the time using disk I/O. I am only interested in the time a STL call takes(STL sort) not the disk I/O.

我不包括使用磁盘 I/O 的时间。我只对 STL 调用花费的时间(STL 排序)而不是磁盘 I/O 感兴趣。

采纳答案by Nys

A i7-2670QMprocessor has 4 cores. But it can run 8 threads in parallel. This means that it only has 4 processing units (Cores) but has support in hardware to run 8 threads in parallel. This means that a maximum of four jobs run in on the Cores, if one of the jobs stall due to for example memory access another thread can very fast start executing on the free Core with very little penalty. Read more on Hyper threading. In Reality there are few scenarios where hyper threading gives a large performance gain. More modern processors handle hyper threading better than older processors.

i7-2670QM处理器具有4个核。但它可以并行运行 8 个线程。这意味着它只有 4 个处理单元(核心),但在硬件上支持并行运行 8 个线程。这意味着在内核上最多运行四个作业,如果其中一个作业由于例如内存访问而停止,另一个线程可以非常快速地在空闲内核上开始执行,而损失很小。阅读有关超线程的更多信息。在现实中,很少有超线程带来巨大性能提升的场景。更现代的处理器比旧处理器更好地处理超线程。

Your benchmark showed that it was CPU bound, i.e. There was little stalls in the pipeline that would have given Hyper Threading an advantage. 50% CPU is correct has the 4 cores are working and the 4 extra are not doing anything. Turn of hyper threading in the BIOS and you will see 100% CPU.

您的基准测试表明它受 CPU 限制,即管道中几乎没有停顿会给超线程带来优势。50% 的 CPU 是正确的,4 个内核正在工作,而 4 个额外的内核没有做任何事情。在 BIOS 中打开超线程,您将看到 100% CPU。

回答by Andrew Brock

This is a quick summary of Hyperthreading/HyperTransport

这是超线程/超传输的快速总结

Thread switching is slow, having to stop execution, copy a bunch of values into memory, copy a bunch of values out of memory into the CPU, then start things going again with the new thread.

线程切换很慢,必须停止执行,将一堆值复制到内存中,将一堆值从内存中复制到 CPU 中,然后用新线程重新开始。

This is where your 4 virtual cores come in. You have 4 cores, that is it, but what hyperthreading allows the CPU to do is have 2 threads on a single core.

这就是你的 4 个虚拟内核的用武之地。你有 4 个内核,就是这样,但是超线程允许 CPU 在一个内核上有 2 个线程。

Only 1 thread can execute at a time, however when 1 thread needs to stop to do a memory access, disk access or anything else that is going to take some time, it can switch in the other thread and run it for a bit. On old processors, they basically had a bit of a sleep in this time.

一次只能执行 1 个线程,但是当 1 个线程需要停止执行内存访问、磁盘访问或其他需要一些时间的操作时,它可以切换到另一个线程并运行一会儿。在旧处理器上,他们在这段时间里基本上有一点睡眠。

So your quad core has 4 cores, which can do 1 thing at a time each, but can have a 2nd job on standby as soon as they need to wait on another part of the computer.

因此,您的四核有 4 个内核,每个内核一次可以做 1 件事,但是一旦他们需要等待计算机的另一部分,就可以有第二个工作处于待机状态。

If your task has a lot of memory usage and a lot of CPU usage, you should see a slight decrease in total execution time, but if you are almost entirely CPU bound you will be better off sticking with just 4 threads

如果您的任务有大量内存使用和大量 CPU 使用,您应该会看到总执行时间略有减少,但如果您几乎完全受 CPU 限制,您最好坚持只使用 4 个线程

回答by sergico

The important piece of information to understand here is the difference between physical and logical thread.
If you have 4 physical cores on your CPU, that means you have physical resources to execute 4 distinct thread of execution in parallel. So, if your threads do not have data contention, you can normally measure a x4 performance increase, compared to the speed of the single thread.
I'm also assuming that the OS (or you :)) sets the thread affinity correctly, so each thread is run on each physical core.
When you enable HT (Hyper-Threading) on your CPU the core frequency is not modified. :)
What happen is that partof the hw pipeline (inside the core and around (uncore, cache, etc)) is duplicated, but part of it is still shared between the logical threads. That's the reason why you do not measure a x8 performance increase. In my experience enabling all logical cores you can get a x1.5 - x1.7 performance improvement per physical core, depending on the code you are executing, cache usage (remember that the L1 cache is shared between two logical cores/1 physical core, for instance), thread affinity, and so on and so forth. Hope this helps.

这里要理解的重要信息是物理线程和逻辑线程之间的区别。
如果您的 CPU 上有 4 个物理内核,则意味着您有物理资源可以并行执行 4 个不同的执行线程。因此,如果您的线程没有数据争用,与单线程的速度相比,您通常可以测量 x4 的性能提升。
我还假设操作系统(或您 :))正确设置了线程关联,因此每个线程都在每个物理内核上运行。
当您在 CPU 上启用 HT(超线程)时,核心频率不会被修改。:)
部分会发生什么硬件管道(核心内部和周围(非核心、缓存等))的一部分是重复的,但其中一部分仍然在逻辑线程之间共享。这就是您不衡量 x8 性能提升的原因。根据我启用所有逻辑核心的经验,每个物理核心可以获得 x1.5 - x1.7 的性能提升,具体取决于您正在执行的代码、缓存使用情况(请记住,L1 缓存在两个逻辑核心/1 个物理核心之间共享) ,例如)、线程关联等等。希望这可以帮助。

回答by Martin James

Some actual numbers:

一些实际数字:

CPU-intensive task on my i7, (adding numbers from 1-1000000000 into an int var, 16 times), averaged over 8 tests:

我的 i7 上的 CPU 密集型任务(将 1-1000000000 的数字添加到 int var,16 次),平均超过 8 次测试:

Summary, threads/ticks:

总结,线程/滴答:

1/26414
4/8923
8/6659
12/6592
16/6719
64/6811
128/6778

Note that in the 'using X threads' line in the reports below, X is one greater than the number of threads available to do the tasks - one thread submits the tasks and waits on a countdown-latch evnet for their completion - it processes none of the CPU-heavy tasks and used no CPU.

请注意,在下面报告的“使用 X 线程”行中,X 比可用于执行任务的线程数大 1 - 一个线程提交任务并等待倒计时锁存事件网络完成 - 它不处理占用大量 CPU 的任务并且不使用 CPU。

8 tests,
16 tasks,
counting to 1000000000,
using 2 threads:
Ticks: 26286
Ticks: 26380
Ticks: 26317
Ticks: 26474
Ticks: 26442
Ticks: 26426
Ticks: 26474
Ticks: 26520
Average: 26414 ms

8 tests,
16 tasks,
counting to 1000000000,
using 5 threads:
Ticks: 8799
Ticks: 9157
Ticks: 8829
Ticks: 9002
Ticks: 9173
Ticks: 8720
Ticks: 8830
Ticks: 8876
Average: 8923 ms

8 tests,
16 tasks,
counting to 1000000000,
using 9 threads:
Ticks: 6615
Ticks: 6583
Ticks: 6630
Ticks: 6599
Ticks: 6521
Ticks: 6895
Ticks: 6848
Ticks: 6583
Average: 6659 ms

8 tests,
16 tasks,
counting to 1000000000,
using 13 threads:
Ticks: 6661
Ticks: 6599
Ticks: 6552
Ticks: 6630
Ticks: 6583
Ticks: 6583
Ticks: 6568
Ticks: 6567
Average: 6592 ms

8 tests,
16 tasks,
counting to 1000000000,
using 17 threads:
Ticks: 6739
Ticks: 6864
Ticks: 6599
Ticks: 6693
Ticks: 6676
Ticks: 6864
Ticks: 6646
Ticks: 6677
Average: 6719 ms

8 tests,
16 tasks,
counting to 1000000000,
using 65 threads:
Ticks: 7223
Ticks: 6552
Ticks: 6879
Ticks: 6677
Ticks: 6833
Ticks: 6786
Ticks: 6739
Ticks: 6802
Average: 6811 ms

8 tests,
16 tasks,
counting to 1000000000,
using 129 threads:
Ticks: 6771
Ticks: 6677
Ticks: 6755
Ticks: 6692
Ticks: 6864
Ticks: 6817
Ticks: 6849
Ticks: 6801
Average: 6778 ms

回答by Hristo Iliev

HT is called SMT (Simultaneous MultiThreading) or HTT (HyperThreading Technology) in most BIOSes. The efficiency of HT depends on the so called compute-to-fetch ratio that is how many in-core (or register/cache) operations your code does before it fetches from or stores to the slow main memory or I/O memory. For highly cache efficient and CPU-bound codes the HT gives almost no noticeable performance increase. For more memory bound codes the HT can really benefit the execution due to the so-called "latency hiding". That's why most non-x86 server CPUs provide 4 (e.g. IBM POWER7) to 8 (e.g. UltraSPARC T4) hardware threads per core. These CPUs are usually used in database and transactional processing systems where many concurrent memory-bound requests are serviced at once.

HT 在大多数 BIOS 中称为 SMT(同步多线程)或 HTT(超线程技术)。HT 的效率取决于所谓的计算取取比,即您的代码在从慢速主内存或 I/O 内存取数据或存储到慢速主内存或 I/O 内存之前执行了多少核内(或寄存器/缓存)操作。对于高速缓存效率高且受 CPU 限制的代码,HT 几乎不会带来明显的性能提升。对于更多内存绑定代码,由于所谓的“延迟隐藏”,HT 可以真正有利于执行。这就是为什么大多数非 x86 服务器 CPU 为每个内核提供 4(例如 IBM POWER7)到 8(例如 UltraSPARC T4)硬件线程的原因。这些 CPU 通常用于数据库和事务处理系统,在这些系统中,许多并发内存绑定请求同时得到服务。

By the way, the Amdhal's law states that the upper limit of the parallel speedup is one over the serial fraction of the code. Usually the serial fraction increases with the number of processing elements if there is (possibly hidden in the runtime) communication or other synchronisation between the threads, although sometimes cache effects can lead to superlinear speedup and sometimes cache trashing can reduce performance drastically.

顺便说一下,Amdhal 定律指出并行加速比的上限是代码的串行部分。通常,如果线程之间存在(可能隐藏在运行时中)通信或其他同步,串行部分会随着处理元素的数量而增加,尽管有时缓存效应会导致超线性加速,有时缓存垃圾会大大降低性能。