C/C++ 中简单快速的矩阵向量乘法

Question

提问by Serg

I need frequent usage of matrix_vector_mult()which multiplies matrix with vector, and below is its implementation.

我需要经常使用matrix_vector_mult()which 将矩阵与向量相乘，下面是它的实现。

Question: Is there a simple way to make it significantly, at least twice, faster?

问题：有没有一种简单的方法可以让它显着地，至少两次，更快？

Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the run. 2) It must work on both Windows and Linux.

备注： 1) 矩阵的大小约为 300x50。它在运行期间不会改变。2) 它必须同时适用于 Windows 和 Linux。

double vectors_dot_prod(const double *x, const double *y, int n)
{
    double res = 0.0;
    int i;
    for (i = 0; i < n; i++)
    {
        res += x[i] * y[i];
    }
    return res;
}

void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
    int i;
    for (i = 0; i < rows; i++)
    {
        result[i] = vectors_dot_prod(mat[i], vec, cols);
    }
}

Answer 1

回答by 6502

This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):

这是理论上一个好的编译器应该自己做的事情，但是我尝试了我的系统（g++ 4.6.3）并通过手动展开 4 个乘法（每个矩阵大约 18us，而不是每个矩阵 34us）：

double vectors_dot_prod2(const double *x, const double *y, int n)
{
    double res = 0.0;
    int i = 0;
    for (; i <= n-4; i+=4)
    {
        res += (x[i] * y[i] +
                x[i+1] * y[i+1] +
                x[i+2] * y[i+2] +
                x[i+3] * y[i+3]);
    }
    for (; i < n; i++)
    {
        res += x[i] * y[i];
    }
    return res;
}

I expect however the results of this level of micro-optimization to vary wildly between systems.

然而，我预计这种级别的微优化的结果在系统之间会有很大差异。

Answer 2

回答by Useless

As Zhenya says, just use a good BLAS or matrix math library.

正如振亚所说，只需使用一个好的 BLAS 或矩阵数学库。

If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rowsand colsare both constants at the call site may help, assuming the functions you posted are available for inlining

如果由于某种原因您不能这样做，请查看您的编译器是否可以展开和/或矢量化您的循环；确保行和的cols是在调用点可以帮助双方常数，假设您发布的功能，可用于内联

If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.

如果您仍然无法获得所需的加速，您正在考虑手动展开，并使用扩展或内联汇编程序进行矢量化。

Answer 3

回答by djechlin

If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.

如果大小是常数并且事先已知，则将其作为预编译器变量传入，这将允许编译器更全面地优化。

C/C++ 中简单快速的矩阵向量乘法

提问by Serg

回答by 6502

回答by Useless

回答by djechlin

相关推荐

最近更新

标签

C/C++ 中简单快速的矩阵向量乘法

提问by Serg

回答by 6502

回答by Useless

回答by djechlin

相关推荐

C++ 校验和计算 - 所有字节的二进制补码和

如何在 C++ 中逐个字符地从文本文件中读取

C++ 有没有办法从包含类名的字符串中实例化对象？

C++ 提升“没有这样的文件或目录”

相关推荐

最近更新

标签