C/C++ 中简单快速的矩阵向量乘法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12289235/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simple and fast matrix-vector multiplication in C / C++
提问by Serg
I need frequent usage of matrix_vector_mult()
which multiplies matrix with vector, and below is its implementation.
我需要经常使用matrix_vector_mult()
which 将矩阵与向量相乘,下面是它的实现。
Question: Is there a simple way to make it significantly, at least twice, faster?
问题:有没有一种简单的方法可以让它显着地,至少两次,更快?
Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the run. 2) It must work on both Windows and Linux.
备注: 1) 矩阵的大小约为 300x50。它在运行期间不会改变。2) 它必须同时适用于 Windows 和 Linux。
double vectors_dot_prod(const double *x, const double *y, int n)
{
double res = 0.0;
int i;
for (i = 0; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
int i;
for (i = 0; i < rows; i++)
{
result[i] = vectors_dot_prod(mat[i], vec, cols);
}
}
回答by 6502
This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):
这是理论上一个好的编译器应该自己做的事情,但是我尝试了我的系统(g++ 4.6.3)并通过手动展开 4 个乘法(每个矩阵大约 18us,而不是每个矩阵 34us):
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
I expect however the results of this level of micro-optimization to vary wildly between systems.
然而,我预计这种级别的微优化的结果在系统之间会有很大差异。
回答by Useless
As Zhenya says, just use a good BLAS or matrix math library.
正如振亚所说,只需使用一个好的 BLAS 或矩阵数学库。
If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rowsand colsare both constants at the call site may help, assuming the functions you posted are available for inlining
如果由于某种原因您不能这样做,请查看您的编译器是否可以展开和/或矢量化您的循环;确保行和的cols是在调用点可以帮助双方常数,假设您发布的功能,可用于内联
If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.
如果您仍然无法获得所需的加速,您正在考虑手动展开,并使用扩展或内联汇编程序进行矢量化。
回答by djechlin
If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.
如果大小是常数并且事先已知,则将其作为预编译器变量传入,这将允许编译器更全面地优化。