Linux 在同一 CPU 内核上执行的 OpenMP 线程
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9370754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
OpenMP threads executing on the same cpu core
提问by Grizzly
I'm currently parallelizing program using openmp on a 4-core phenom2. However I noticed that my parallelization does not do anything for the performance. Naturally I assumed I missed something (falsesharing, serialization through locks, ...), however I was unable to find anything like that. Furthermore from the CPU Utilization it seemed like the program was executed on only one core. From what I found sched_getcpu()
should give me the Id of the core the thread executing the call is currently scheduled on. So I wrote the following test program:
我目前正在 4 核 phenom2 上使用 openmp 并行化程序。但是我注意到我的并行化对性能没有任何作用。自然地,我认为我错过了一些东西(错误共享,通过锁序列化,......),但是我找不到类似的东西。此外,从 CPU 利用率来看,程序似乎只在一个内核上执行。根据我的发现,sched_getcpu()
应该给我当前调度执行调用的线程的核心 ID。于是我写了下面的测试程序:
#include <iostream>
#include <sstream>
#include <omp.h>
#include <utmpx.h>
#include <random>
int main(){
#pragma omp parallel
{
std::default_random_engine rand;
int num = 0;
#pragma omp for
for(size_t i = 0; i < 1000000000; ++i) num += rand();
auto cpu = sched_getcpu();
std::ostringstream os;
os<<"\nThread "<<omp_get_thread_num()<<" on cpu "<<sched_getcpu()<<std::endl;
std::cout<<os.str()<<std::flush;
std::cout<<num;
}
}
On my machine this gives the following output(the random numbers will vary of course):
在我的机器上,这给出了以下输出(随机数当然会有所不同):
Thread 2 on cpu 0 num 127392776
Thread 0 on cpu 0 num 1980891664
Thread 3 on cpu 0 num 431821313
Thread 1 on cpu 0 num -1976497224
From this I assume that all threads execute on the same core (the one with id 0). To be more certain I also tried the approach from this answer. The results where the same. Additionally using #pragma omp parallel num_threads(1)
didn't make the execution slower (slightly faster in fact), lending credibility to the theory that all threads use the same cpu, however the fact that the cpu is always displayed as 0
makes me kind of suspicious. Additionally I checked GOMP_CPU_AFFINITY
which was initially not set, so I tried setting it to 0 1 2 3
, which should bind each thread to a different core from what I understand. However that didn't make a difference.
由此我假设所有线程都在同一个内核(id 为 0 的内核)上执行。更确定的是,我也尝试了这个答案中的方法。结果哪里一样。另外使用#pragma omp parallel num_threads(1)
并没有使执行变慢(实际上稍微快一点),这为所有线程使用相同的 cpu 的理论提供了可信度,但是 cpu 总是显示为的事实0
让我有点怀疑。此外,我检查了GOMP_CPU_AFFINITY
最初未设置的内容,因此我尝试将其设置为0 1 2 3
,这应该将每个线程绑定到我所理解的不同核心。然而这并没有什么不同。
Since develop on a windows system, I use linux in virtualbox for my development. So I though that maybe the virtual system couldn't access all cores. However checking the settings of virtualbox showed that the virtual machine should get all 4 cores and executing my test program 4 times at the same time seems to use all 4 cores judging from the cpu utilization (and the fact that the system was getting very unresponsive).
由于在 windows 系统上开发,我在 virtualbox 中使用 linux 进行开发。所以我认为虚拟系统可能无法访问所有内核。然而,检查virtualbox的设置显示虚拟机应该获得所有4个核心并同时执行我的测试程序4次,从cpu利用率来看似乎使用了所有4个核心(以及系统变得非常无响应的事实) .
So for my question is basically what exactly is going on here. More to the point: Is my deduction that all threads use the same core correctly? If it is, what could be the reasons for that behavious?
所以我的问题基本上是这里到底发生了什么。更重要的是: 我是否推断所有线程都正确使用相同的核心?如果是,那么这种行为的原因可能是什么?
采纳答案by Grizzly
After some experimentation I found out that the problem was that I was starting my program from inside the eclipse IDE, which seemingly set the affinity to use only one core. I thought I got the same problems when starting from outside of the IDE, but a repeated test showed that the program works just fine, when started from the terminal instead of from inside the ide.
经过一些实验,我发现问题在于我是从 Eclipse IDE 内部启动我的程序,这似乎将亲和力设置为仅使用一个内核。我以为从 IDE 外部启动时会遇到同样的问题,但是重复测试表明,当从终端启动而不是从 IDE 内部启动时,该程序运行良好。
回答by Nav
You should use #pragma omp parallel for
And yes, you're right about not needing OMP_NUM_THREADS. omp_set_num_threads(4);
should also have done fine.
您应该使用#pragma omp parallel for
是的,您不需要 OMP_NUM_THREADS 是对的。omp_set_num_threads(4);
也应该做得很好。
回答by krishnaraj
if you are running on windows, try this:
如果您在 Windows 上运行,请尝试以下操作:
c:\windows\system32\cmd.exe /C start /affinity F path\to\your\program.exe
c:\windows\system32\cmd.exe /C start /affinity F path\to\your\program.exe
/affinity 1 uses CPU0
/affinity 1 使用 CPU0
/affinity 2 uses CPU1
/affinity 2 使用 CPU1
/affinity 3 uses CPU0 and CPU1
/affinity 3 使用 CPU0 和 CPU1
/affinity 4 uses CPU2
/affinity 4 使用 CPU2
/affinity F uses all 4 cores
/affinity F 使用所有 4 个内核
Convert the number to hex, and see the bits from right which are the cores to be used.
将数字转换为十六进制,然后查看右边的位是要使用的核心。
you can verify the affinity while its running using task-manager.
您可以在使用任务管理器运行时验证亲和力。
回答by baol
I compiled your program using g++ 4.6 on Linux
我在 Linux 上使用 g++ 4.6 编译了你的程序
g++ --std=c++0x -fopenmp test.cc -o test
The output was, unsurprisingly:
不出所料,输出是:
Thread 2 on cpu 2
Thread 3 on cpu 1
910270973
Thread 1 on cpu 3
910270973
Thread 0 on cpu 0
910270973910270973
The fact that 4 threads are started (if you have not set the number of threads in any way, e.g. using OMP_NUM_THREADS) should imply that the program is able to see 4 usable CPUs. I cannot guess why it is not using them but I suspect a problem in your hardware/software setting, in some environment variable, or in the compiler options.
启动了 4 个线程的事实(如果您没有以任何方式设置线程数,例如使用 OMP_NUM_THREADS)应该意味着程序能够看到 4 个可用的 CPU。我无法猜测它为什么不使用它们,但我怀疑您的硬件/软件设置、某些环境变量或编译器选项中存在问题。