C语言 CUDA：具有共享内存和非块大小倍数的矩阵大小的平铺矩阵-矩阵乘法

Question

提问by Mike

I'm trying to familiarize myself with CUDA programming, and having a pretty fun time of it. I'm currently looking at thispdf which deals with matrix multiplication, done with and without shared memory. Full code for both versions can be found here. This code is almost the exact same as what's in the CUDA matrix multiplication samples. Although the non-shared memory version has the capability to run at any matrix size, regardless of block size, the shared memory version must work with matrices that are a multiple of the block size (which I set to 4, default was originally 16).

我试图让自己熟悉 CUDA 编程，并从中度过一段非常有趣的时光。我目前正在看这个pdf，它处理矩阵乘法，有和没有共享内存。可以在此处找到两个版本的完整代码。此代码几乎与 CUDA 矩阵乘法示例中的代码完全相同。尽管非共享内存版本能够在任何矩阵大小下运行，无论块大小如何，共享内存版本必须使用块大小倍数的矩阵（我设置为 4，默认最初为 16） .

One of the problems suggested at the end of the pdf is to change it so that the shared memory version can also work with non-multiples of the block size. I thought this would be a simple index check, like in the non-shared version:

pdf末尾建议的问题之一是对其进行更改，以便共享内存版本也可以与块大小的非倍数一起使用。我认为这将是一个简单的索引检查，就像在非共享版本中一样：

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;

But this doesn't work. Here's the full code, minus the main method (a bit of a mess, sorry), which has been modified somewhat by me:

但这不起作用。这是完整的代码，减去主要方法（有点混乱，抱歉），我对其进行了一些修改：

void MatMul(const Matrix A, const Matrix B, Matrix C) { 
  // Load A and B to device memory 
  Matrix d_A; 
  d_A.width = d_A.stride = A.width; 
  d_A.height = A.height; 
  size_t size = A.width * A.height * sizeof(float); 
  cudaError_t err = cudaMalloc(&d_A.elements, size); 
  printf("CUDA malloc A: %s\n",cudaGetErrorString(err)); 
  err = cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 
  printf("Copy A to device: %s\n",cudaGetErrorString(err)); 

  Matrix d_B; 
  d_B.width = d_B.stride = B.width; 
  d_B.height = B.height; 
  size = B.width * B.height * sizeof(float); 
  err = cudaMalloc(&d_B.elements, size); 
  printf("CUDA malloc B: %s\n",cudaGetErrorString(err));
  err = cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);
  printf("Copy B to device: %s\n",cudaGetErrorString(err)); 

  Matrix d_C; 
  d_C.width = d_C.stride = C.width; 
  d_C.height = C.height; 
  size = C.width * C.height * sizeof(float); 
  err = cudaMalloc(&d_C.elements, size); 
  printf("CUDA malloc C: %s\n",cudaGetErrorString(err));

  dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
    dim3 dimGrid((B.width + dimBlock.x - 1) / dimBlock.x, (A.height + dimBlock.y-1) / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 
    err = cudaThreadSynchronize();
    printf("Run kernel: %s\n", cudaGetErrorString(err));

  // Read C from device memory 
  err = cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost); 
  printf("Copy C off of device: %s\n",cudaGetErrorString(err));

  // Free device memory
  cudaFree(d_A.elements); 
  cudaFree(d_B.elements); 
  cudaFree(d_C.elements); 
} 

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col) { 
  return A.elements[row * A.stride + col]; 
} 

// Set a matrix element 
__device__ void SetElement(Matrix A, int row, int col, float value) { 
  A.elements[row * A.stride + col] = value; 
} 

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is 
// located col sub-matrices to the right and row sub-matrices down 
// from the upper-left corner of A 
__device__ Matrix GetSubMatrix(Matrix A, int row, int col) { 
  Matrix Asub; 
  Asub.width = BLOCK_SIZE; 
  Asub.height = BLOCK_SIZE; 
  Asub.stride = A.stride; 
  Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col]; 
  return Asub; 
}


// Matrix multiplication kernel called by MatMul() 
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { 
  // Block row and column 
  int blockRow = blockIdx.y; 
  int blockCol = blockIdx.x; 

  int rowTest = blockIdx.y * blockDim.y + threadIdx.y;
  int colTest = blockIdx.x * blockDim.x + threadIdx.x;
  if (rowTest>A.height || colTest>B.width)
    return;
  // Each thread block computes one sub-matrix Csub of C
  Matrix Csub = GetSubMatrix(C, blockRow, blockCol); 

  // Each thread computes one element of Csub 
  // by accumulating results into Cvalue 
  float Cvalue = 0.0; 
  // Thread row and column within Csub 
  int row = threadIdx.y; 
  int col = threadIdx.x; 
  // Loop over all the sub-matrices of A and B that are 
  // required to compute Csub 
  // Multiply each pair of sub-matrices together 
  // and accumulate the results 
  for (int m = 0; m < (BLOCK_SIZE + A.width - 1)/BLOCK_SIZE; ++m) {
    // Get sub-matrix Asub of A 
    Matrix Asub = GetSubMatrix(A, blockRow, m); 

    // Get sub-matrix Bsub of B 
    Matrix Bsub = GetSubMatrix(B, m, blockCol); 

    // Shared memory used to store Asub and Bsub respectively 
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 

    // Load Asub and Bsub from device memory to shared memory 
    // Each thread loads one element of each sub-matrix 
    As[row][col] = GetElement(Asub, row, col); 
    Bs[row][col] = GetElement(Bsub, row, col); 

    // Synchronize to make sure the sub-matrices are loaded 
    // before starting the computation 
    __syncthreads(); 

    // Multiply Asub and Bsub together 
    for (int e = 0; e < BLOCK_SIZE; ++e) 
    {
      Cvalue += As[row][e] * Bs[e][col];
    }
    // Synchronize to make sure that the preceding 
    // computation is done before loading two new 
    // sub-matrices of A and B in the next iteration 
    __syncthreads();  
  }
  // Write Csub to device memory 
  // Each thread writes one element 
  SetElement(Csub, row, col, Cvalue); 
}

notable things which I changed: I added a check in MatMulKernel that checks if our current thread is trying to work on a spot in C that doesn't exist. This doesn't seem to work. Although it does change the result, the changes don't seem to have any pattern other than that later (higher x or y value) entries seem to be more affected (and I get a lot more non-integer results). I also changed the given dimGrid calculation method and the loop condition for m in MatMulKernel(before it was just width or height divided by block size, which seemed wrong).

我改变的值得注意的事情：我在 MatMulKernel 中添加了一个检查，检查我们当前的线程是否试图在 C 中不存在的地方工作。这似乎不起作用。尽管它确实改变了结果，但除了后来的（更高的 x 或 y 值）条目似乎受到更大影响之外，这些变化似乎没有任何模式（并且我得到了更多的非整数结果）。我还更改了给定的 dimGrid 计算方法和 MatMulKernel 中 m 的循环条件（之前它只是宽度或高度除以块大小，这似乎是错误的）。

Even the solutions guide that I found for this guide seems to suggest it should just be a simple index check, so I think I'm missing something really fundamental.

甚至我为本指南找到的解决方案指南似乎也表明它应该只是一个简单的索引检查，所以我想我错过了一些非常基本的东西。

Answer 1

回答by HymanOLantern

When the matrix dimensions are not multiples of the tile dimensions, then it can happen that some tiles cover the matrices only partially. The tile elements falling outside the not-fully overlapping tiles should be properly zero-ed. So, extending your code to arbitrarly sized matrices is easy, but does not amount at a simple index check. Below, I'm copying and pasting my version of the tiled matrix-matrix multiplication kernel with arbitrarily sized matrices

当矩阵维度不是瓦片维度的倍数时，可能会发生某些瓦片仅部分覆盖矩阵的情况。落在不完全重叠的图块之外的图块元素应该正确归零。因此，将您的代码扩展到任意大小的矩阵很容易，但不是简单的索引检查。下面，我正在使用任意大小的矩阵复制和粘贴我的平铺矩阵矩阵乘法内核版本

__global__ void MatMul(float* A, float* B, float* C, int ARows, int ACols, int BRows,
    int BCols, int CRows, int CCols)
{
    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    __shared__ float As[TILE_DIM][TILE_DIM];
    __shared__ float Bs[TILE_DIM][TILE_DIM];

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

         if (k*TILE_DIM + threadIdx.x < ACols && Row < ARows)
             As[threadIdx.y][threadIdx.x] = A[Row*ACols + k*TILE_DIM + threadIdx.x];
         else
             As[threadIdx.y][threadIdx.x] = 0.0;

         if (k*TILE_DIM + threadIdx.y < BRows && Col < BCols)
             Bs[threadIdx.y][threadIdx.x] = B[(k*TILE_DIM + threadIdx.y)*BCols + Col];
         else
             Bs[threadIdx.y][threadIdx.x] = 0.0;

         __syncthreads();

         for (int n = 0; n < TILE_DIM; ++n)
             CValue += As[threadIdx.y][n] * Bs[n][threadIdx.x];

         __syncthreads();
    }

    if (Row < CRows && Col < CCols)
        C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols) +
           (blockIdx.x * blockDim.x)+ threadIdx.x] = CValue;
}

C语言 CUDA：具有共享内存和非块大小倍数的矩阵大小的平铺矩阵-矩阵乘法

提问by Mike

回答by HymanOLantern

相关推荐

最近更新

标签

C语言 CUDA：具有共享内存和非块大小倍数的矩阵大小的平铺矩阵-矩阵乘法

提问by Mike

回答by HymanOLantern

相关推荐

C语言 Eclipse - “这个项目不是 CDT 项目”

C语言 在C中以十六进制读取文件的内容

C语言 将 C 中的字符串拆分为每个空格

C语言 typedef 定长数组

相关推荐

最近更新

标签

C语言在C中以十六进制读取文件的内容

C语言将 C 中的字符串拆分为每个空格