C++ 为什么 Opencv GPU 代码比 CPU 慢?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12074281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why Opencv GPU code is slower than CPU?
提问by David Ding
I'm using opencv242 + VS2010 by a notebook.
I tried to do some simple test of the GPU block in OpenCV, but it showed the GPU is 100 times slower than CPU codes.
In this code, I just turn the color image to grayscale image, use the function of cvtColor
我正在通过笔记本使用 opencv242 + VS2010。
我尝试在 OpenCV 中对 GPU 块进行一些简单的测试,但结果表明 GPU 比 CPU 代码慢 100 倍。在这段代码中,我只是将彩色图像转为灰度图像,使用cvtColor的功能
Here is my code, PART1 is CPU code(test cpu RGB2GRAY), PART2 is upload image to GPU, PART3 is GPU RGB2GRAY, PART4 is CPU RGB2GRAY again. There are 3 things makes me so wondering:
这是我的代码,PART1是CPU代码(测试cpu RGB2GRAY),PART2是上传图像到GPU,PART3是GPU RGB2GRAY,PART4是CPU RGB2GRAY。有3件事让我很疑惑:
1 In my code, part1 is 0.3ms, while part4 (which is exactly same with part1) is 40ms!!!
2 The part2 which upload image to GPU is 6000ms!!!
3 Part3( GPU codes) is 11ms, it is so slow for this simple image!
1 在我的代码中,part1 是 0.3ms,而 part4(与 part1 完全相同)是 40ms!!!
2 上传图片到GPU的part2是6000ms!!!
3 Part3(GPU代码)是11ms,对于这个简单的图像来说太慢了!
#include "StdAfx.h"
#include <iostream>
#include "opencv2/opencv.hpp"
#include "opencv2/gpu/gpu.hpp"
#include "opencv2/gpu/gpumat.hpp"
#include "opencv2/core/core.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <ctime>
#include <windows.h>
using namespace std;
using namespace cv;
using namespace cv::gpu;
int main()
{
LARGE_INTEGER freq;
LONGLONG QPart1,QPart6;
double dfMinus, dfFreq, dfTim;
QueryPerformanceFrequency(&freq);
dfFreq = (double)freq.QuadPart;
cout<<getCudaEnabledDeviceCount()<<endl;
Mat img_src = imread("d:\CUDA\train.png", 1);
// PART1 CPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// From color image to grayscale image.
QueryPerformanceCounter(&freq);
QPart1 = freq.QuadPart;
Mat img_gray;
cvtColor(img_src,img_gray,CV_BGR2GRAY);
QueryPerformanceCounter(&freq);
QPart6 = freq.QuadPart;
dfMinus = (double)(QPart6 - QPart1);
dfTim = 1000 * dfMinus / dfFreq;
printf("CPU RGB2GRAY running time is %.2f ms\n\n",dfTim);
// PART2 GPU upload image~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
GpuMat gimg_src;
QueryPerformanceCounter(&freq);
QPart1 = freq.QuadPart;
gimg_src.upload(img_src);
QueryPerformanceCounter(&freq);
QPart6 = freq.QuadPart;
dfMinus = (double)(QPart6 - QPart1);
dfTim = 1000 * dfMinus / dfFreq;
printf("Read image running time is %.2f ms\n\n",dfTim);
GpuMat dst1;
QueryPerformanceCounter(&freq);
QPart1 = freq.QuadPart;
/*dst.upload(src_host);*/
dst1.upload(imread("d:\CUDA\train.png", 1));
QueryPerformanceCounter(&freq);
QPart6 = freq.QuadPart;
dfMinus = (double)(QPart6 - QPart1);
dfTim = 1000 * dfMinus / dfFreq;
printf("Read image running time 2 is %.2f ms\n\n",dfTim);
// PART3~ GPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// gpuimage From color image to grayscale image.
QueryPerformanceCounter(&freq);
QPart1 = freq.QuadPart;
GpuMat gimg_gray;
gpu::cvtColor(gimg_src,gimg_gray,CV_BGR2GRAY);
QueryPerformanceCounter(&freq);
QPart6 = freq.QuadPart;
dfMinus = (double)(QPart6 - QPart1);
dfTim = 1000 * dfMinus / dfFreq;
printf("GPU RGB2GRAY running time is %.2f ms\n\n",dfTim);
// PART4~CPU code(again)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// gpuimage From color image to grayscale image.
QueryPerformanceCounter(&freq);
QPart1 = freq.QuadPart;
Mat img_gray2;
cvtColor(img_src,img_gray2,CV_BGR2GRAY);
BOOL i_test=QueryPerformanceCounter(&freq);
printf("%d \n",i_test);
QPart6 = freq.QuadPart;
dfMinus = (double)(QPart6 - QPart1);
dfTim = 1000 * dfMinus / dfFreq;
printf("CPU RGB2GRAY running time is %.2f ms\n\n",dfTim);
cvWaitKey();
getchar();
return 0;
}
回答by Martin Beckett
cvtColor isn't doing very much work, to make grey all you have to is average three numbers.
cvtColor 并没有做太多的工作,要使灰色,您所要做的就是平均三个数字。
The cvColor code on the CPU is using SSE2 instructions to process upto 8 pixels at once and if you have TBB it's using all the cores/hyperthreads, the CPU is running at 10x the clock speed of the GPU and finally you don't have to copy data onto the GPU and back.
CPU 上的 cvColor 代码使用 SSE2 指令一次最多处理 8 个像素,如果您有 TBB,它将使用所有内核/超线程,CPU 以 GPU 时钟速度的 10 倍运行,最后您不必将数据复制到 GPU 并返回。
回答by TimZaman
Most answers above are actually wrong. The reason why it is slow by a factor 20.000 is of course not because of 'CPU clockspeed is faster' and 'it has to copy it to the GPU' (accepted answers). These are factors, but by saying that you omit the fact that you have vastly more computing power for a problem that is disgustingly parallel. Saying 20.000x performance difference is because of the latter is just so plain ridiculous. The author here knew something was wrong that's not straight forward. Solution:
上面的大多数答案实际上是错误的。它慢了 20.000 倍的原因当然不是因为“CPU 时钟速度更快”和“它必须将其复制到 GPU”(已接受的答案)。这些都是因素,但如果说你忽略了一个事实,即对于一个令人厌恶的并行问题,你拥有更多的计算能力。说 20.000 倍的性能差异是因为后者实在是太荒谬了。这里的作者知道有些不对劲,但并非直截了当。解决方案:
Your problem is that CUDA needs to initialize!It will always initialize for the first image and generally takes between 1-10 seconds, depending on the alignment of Jupiter and Mars. Now try this. Do the computation twiceand then time them both. You will probably see in this case that the speeds are within the same order of magnutide, not 20.000x, that's ridiculous. Can you do something about this initialization? Nope, not that I know of. It's a snag.
你的问题是CUDA需要初始化!它将始终为第一张图像初始化,通常需要 1-10 秒,具体取决于木星和火星的对齐情况。现在试试这个。计算两次,然后对它们都计时。在这种情况下,您可能会看到速度在 magnutide 的相同数量级内,而不是 20.000 倍,这太荒谬了。你能对这个初始化做些什么吗?不,不是我所知道的。这是一个障碍。
edit: I just re-read the post. You say you're running on a notebook. Those often have shabby GPU's, and CPU's with a fair turbo.
编辑:我刚刚重新阅读了帖子。你说你在笔记本上运行。那些通常具有破旧的 GPU 和具有公平涡轮增压的 CPU。
回答by Tae
try to run more than once....
尝试运行不止一次....
-----------excerpt from http://opencv.willowgarage.com/wiki/OpenCV%20GPU%20FAQPerfomance
-----------摘自http://opencv.willowgarage.com/wiki/OpenCV%20GPU%20FAQ性能
Why first function call is slow?
为什么第一个函数调用很慢?
That is because of initialization overheads. On first GPU function call Cuda Runtime API is initialized implicitly. Also some GPU code is compiled (Just In Time compilation) for your video card on the first usage. So for performance measure, it is necessary to do dummy function call and only then perform time tests.
那是因为初始化开销。在第一次调用 GPU 函数时,Cuda Runtime API 被隐式初始化。在第一次使用时,还会为您的视频卡编译一些 GPU 代码(即时编译)。所以对于性能测量,需要做伪函数调用,然后才进行时间测试。
If it is critical for an application to run GPU code only once, it is possible to use a compilation cache which is persistent over multiple runs. Please read nvcc documentation for details (CUDA_DEVCODE_CACHE environment variable).
如果应用程序只运行一次 GPU 代码很重要,则可以使用在多次运行中持久的编译缓存。请阅读 nvcc 文档了解详细信息(CUDA_DEVCODE_CACHE 环境变量)。
回答by 1''
cvtColour is a small operation, and any performance boost you get from doing it on the GPU is vastly outweighed by memory transfer times between host (CPU) and device (GPU). Minimizing the latency of this memory transfer is a primary challenge of any GPU computing.
cvtColour 是一个小操作,在 GPU 上执行它所获得的任何性能提升都大大超过主机 (CPU) 和设备 (GPU) 之间的内存传输时间。最大限度地减少这种内存传输的延迟是任何 GPU 计算的主要挑战。
回答by mrgloom
What GPU do you have?
你有什么 GPU?
Check compute compability, maybe it's the reason.
检查计算兼容性,也许这就是原因。
https://developer.nvidia.com/cuda-gpus
https://developer.nvidia.com/cuda-gpus
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw Exception. For platforms where JIT compilation is performed first, the run is slow.
这意味着具有 CC 1.3 和 2.0 二进制映像的设备已准备好运行。对于所有较新的平台,1.3 的 PTX 代码被 JIT 转换为二进制图像。对于具有 CC 1.1 和 1.2 的设备,1.1 的 PTX 是 JIT'ed。对于具有 CC 1.0 的设备,没有可用的代码并且函数会抛出异常。对于先进行 JIT 编译的平台,运行速度较慢。