在 C++ 中转置矩阵的最快方法是什么？

Question

提问by mans

I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is

我有一个需要转置的矩阵（相对较大）。例如假设我的矩阵是

a b c d e f
g h i j k l
m n o p q r

I want the result be as follows:

我希望结果如下：

a g m
b h n
c I o
d j p
e k q
f l r

What is the fastest way to do this?

执行此操作的最快方法是什么？

Answer 1

回答by

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

这是一个很好的问题。您想要在内存中实际转置矩阵而不仅仅是交换坐标的原因有很多，例如在矩阵乘法和高斯拖尾中。

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

首先让我列出我用于转置的功能之一（编辑：请参阅我的答案的结尾，我找到了一个更快的解决方案）

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

现在让我们看看为什么转置很有用。考虑矩阵乘法 C = A*B。我们可以这样做。

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

但是，这种方式将有很多缓存未命中。一个更快的解决方案是先对 B 进行转置

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.

矩阵乘法为 O(n^3)，转置为 O(n^2)，因此采用转置对计算时间的影响可以忽略不计（对于 large n）。在矩阵乘法循环中，平铺甚至比转置更有效，但这要复杂得多。

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

我希望我知道一种更快的转置方法（编辑：我找到了一个更快的解决方案，请参阅我的答案结尾）。当 Haswell/AVX2 几周后出来时，它将具有聚集功能。我不知道这在这种情况下是否会有所帮助，但我可以想象收集一列并写出一行。也许它会使转置变得不必要。

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

对于高斯涂抹，您所做的是水平涂抹然后垂直涂抹。但是垂直涂抹有缓存问题，所以你要做的是

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

这是英特尔的一篇论文，解释了 http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

最后，我在矩阵乘法（以及高斯拖尾）中实际做的不是完全采用转置，而是采用特定矢量大小（例如，SSE/AVX 为 4 或 8）的宽度的转置。这是我使用的功能

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

编辑：

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16(Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

我尝试了几个函数来找到大矩阵的最快转置。最后，最快的结果是使用循环阻塞block_size=16（编辑：我找到了一个使用 SSE 和循环阻塞的更快的解决方案 - 见下文）。此代码适用于任何 NxM 矩阵（即矩阵不必是正方形）。

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values ldaand ldbare the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

值lda和ldb是矩阵的宽度。这些需要是块大小的倍数。为了找到值并为例如 3000x1001 矩阵分配内存，我做这样的事情

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008and lda = 1008

对于 3000x1001，这将返回ldb = 3008并lda = 1008

Edit:

编辑：

I found an even faster solution using SSE intrinsics:

我找到了一个使用 SSE 内在函数的更快的解决方案：

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

Answer 2

回答by Shafik Yaghmour

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.

这将取决于您的应用程序，但通常转置矩阵的最快方法是在您查找时反转您的坐标，然后您不必实际移动任何数据。

Answer 3

回答by Z boson

Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.

关于使用 x86 硬件转置 4x4 方形浮点（我将在稍后讨论 32 位整数）矩阵的一些细节。从这里开始转置较大的方阵（例如 8x8 或 16x16）很有帮助。

_MM_TRANSPOSE4_PS(r0, r1, r2, r3)is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpdwhereas MSVC uses only shufps. We can actually combine these two approaches together like this.

_MM_TRANSPOSE4_PS(r0, r1, r2, r3)由不同的编译器以不同的方式实现。GCC 和 ICC（我没有检查 Clang）使用unpcklps, unpckhps, unpcklpd, unpckhpd而 MSVC 只使用shufps. 我们实际上可以像这样将这两种方法结合在一起。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);

One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.

一个有趣的观察结果是，两个 shuffle 可以像这样转换为一个 shuffle 和两个混合 (SSE4.1)。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

v  = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v  = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);

This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances. Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.

这有效地将 4 次 shuffle 转换为 2 次 shuffle 和 4 次混合。这比 GCC、ICC 和 MSVC 的实现多使用 2 条指令。优点是它降低了端口压力，这在某些情况下可能是有益的。目前所有的洗牌和解包只能到一个特定的端口，而混合可以去两个不同的端口中的任何一个。

I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.

我尝试使用像 MSVC 这样的 8 次 shuffle 并将其转换为 4 次 shuffle + 8 混合，但没有奏效。我仍然不得不使用 4 个解包。

I used this same technique for a 8x8 float transpose (see towards the end of that answer). https://stackoverflow.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.

我对 8x8 浮点转置使用了相同的技术（请参阅该答案的末尾）。 https://stackoverflow.com/a/25627536/2542702。在那个答案中，我仍然必须使用 8 次解包，但我设法将 8 次洗牌转换为 4 次洗牌和 8 次混合。

For 32-bit integers there is nothing like shufps(except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4acts effectively like shufpsexcept for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.

对于 32 位整数，没有什么比shufps（除了 AVX512 的 128 位洗牌），所以它只能通过解包来实现，我认为它不能转换为混合（有效）。使用 AVX512，除了 4 个整数的 128 位通道而不是 32 位浮点数之外，它的vshufi32x4作用类似于shufps，因此vshufi32x4在某些情况下可能使用相同的技术。使用 Knights Landing，shuffle 比 Blends 慢四倍（吞吐量）。

Answer 4

回答by Khaled.K

Consider each row as a column, and each column as a row .. use j,i instead of i,j

将每一行视为一列，将每一列视为一行 .. 使用 j,i 而不是 i,j

demo: http://ideone.com/lvsxKZ

演示：http: //ideone.com/lvsxKZ

#include <iostream> 
using namespace std;

int main ()
{
    char A [3][3] =
    {
        { 'a', 'b', 'c' },
        { 'd', 'e', 'f' },
        { 'g', 'h', 'i' }
    };

    cout << "A = " << endl << endl;

    // print matrix A
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[i][j];
        cout << endl;
    }

    cout << endl << "A transpose = " << endl << endl;

    // print A transpose
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[j][i];
        cout << endl;
    }

    return 0;
}

Answer 5

回答by Rachel Gallen

template <class T>
void transpose( std::vector< std::vector<T> > a,
std::vector< std::vector<T> > b,
int width, int height)
{
    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            b[j][i] = a[i][j];
        }
    }
}

Answer 6

回答by Reza Baram

transposing without any overhead (class not complete):

转置没有任何开销（类不完整）：

class Matrix{
   double *data; //suppose this will point to data
   double _get1(int i, int j){return data[i*M+j];} //used to access normally
   double _get2(int i, int j){return data[j*N+i];} //used when transposed

   public:
   int M, N; //dimensions
   double (*get_p)(int, int); //functor to access elements  
   Matrix(int _M,int _N):M(_M), N(_N){
     //allocate data
     get_p=&Matrix::_get1; // initialised with normal access 
     }

   double get(int i, int j){
     //there should be a way to directly use get_p to call. but i think even this
     //doesnt incur overhead because it is inline and the compiler should be intelligent
     //enough to remove the extra call
     return (this->*get_p)(i,j);
    }
   void transpose(){ //twice transpose gives the original
     if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
     else get_p==&Matrix::_get1; 
     swap(M,N);
     }
}

can be used like this:

可以这样使用：

Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)

of course I didn't bother with the memory management here, which is crucial but different topic.

当然，我没有打扰这里的内存管理，这是至关重要但不同的主题。

Answer 7

回答by Sandeep K V

If the size of the arrays are known prior then we could use the union to our help. Like this-

如果事先知道数组的大小，那么我们可以使用联合来帮助我们。像这样-

#include <bits/stdc++.h>
using namespace std;

union ua{
    int arr[2][3];
    int brr[3][2];
};

int main() {
    union ua uav;
    int karr[2][3] = {{1,2,3},{4,5,6}};
    memcpy(uav.arr,karr,sizeof(karr));
    for (int i=0;i<3;i++)
    {
        for (int j=0;j<2;j++)
            cout<<uav.brr[i][j]<<" ";
        cout<<'\n';
    }

    return 0;
}

Answer 8

回答by Jorge Bellon

Modern linear algebra libraries include optimized versions of the most common operations. Many of them include dynamic CPU dispatch, which chooses the best implementation for the hardware at program execution time (without compromising on portability).

现代线性代数库包括最常见运算的优化版本。其中许多包括动态 CPU 调度，它在程序执行时为硬件选择最佳实现（不影响可移植性）。

This is commonly a better alternative to performing manual optimization of your functinos via vector extensions intrinsic functions. The latter will tie your implementation to a particular hardware vendor and model: if you decide to swap to a different vendor (e.g. Power, ARM) or to a newer vector extensions (e.g. AVX512), you will need to re-implement it again to get the most of them.

这通常是通过向量扩展内在函数手动优化您的函数的更好选择。后者会将您的实现与特定的硬件供应商和型号联系起来：如果您决定更换到不同的供应商（例如 Power、ARM）或更新的向量扩展（例如 AVX512），您将需要再次重新实现它以充分利用它们。

MKL transposition, for example, includes the BLAS extensions function imatcopy. You can find it in other implementations such as OpenBLAS as well:

例如，MKL 转置包括 BLAS 扩展函数imatcopy。您也可以在其他实现中找到它，例如 OpenBLAS：

#include <mkl.h>

void transpose( float* a, int n, int m ) {
    const char row_major = 'R';
    const char transpose = 'T';
    const float alpha = 1.0f;
    mkl_simatcopy (row_major, transpose, n, m, alpha, a, n, n);
}

For a C++ project, you can make use of the Armadillo C++:

对于 C++ 项目，您可以使用 Armadillo C++：

#include <armadillo>

void transpose( arma::mat &matrix ) {
    arma::inplace_trans(matrix);
}

Answer 9

回答by Gennady.F

intel mkl suggests in-place and out-of-place transposition/copying matrices. here is the link to the documentation. I would recommend trying out of place implementation as faster ten in-place and into the documentation of the latest version of mkl contains some mistakes.

intel mkl 建议就地和非就地转置/复制矩阵。这是文档的链接。我建议尝试异地实现，因为就地速度快十个，并且最新版本的 mkl 的文档包含一些错误。

Answer 10

回答by Fayez Abdlrazaq Deab

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

我认为最快速的方法不应该超过 O(n^2) 也这样你可以只使用 O(1) 空间：
这样做的方法是成对交换，因为当你转置矩阵时做的是： M[i][j]=M[j][i] ，所以将 M[i][j] 存储在 temp，然后 M[i][j]=M[j][i]，和最后一步：M[j][i]=temp。这可以通过一次完成，所以它应该花费 O(n^2)

在 C++ 中转置矩阵的最快方法是什么？

提问by mans

回答by

回答by Shafik Yaghmour

回答by Z boson

回答by Khaled.K

回答by Rachel Gallen

回答by Reza Baram

回答by Sandeep K V

回答by Jorge Bellon

回答by Gennady.F

回答by Fayez Abdlrazaq Deab

相关推荐

最近更新

标签

在 C++ 中转置矩阵的最快方法是什么？

提问by mans

回答by

回答by Shafik Yaghmour

回答by Z boson

回答by Khaled.K

回答by Rachel Gallen

回答by Reza Baram

回答by Sandeep K V

回答by Jorge Bellon

回答by Gennady.F

回答by Fayez Abdlrazaq Deab

相关推荐

为 C 消费包装 C++ 类 API

检查字符串是否相同 C++

C++ 类“不是模板类型”

Xcode 使用哪个 C/C++ 编译器？

相关推荐

最近更新

标签