C++ CUDA 中的数组结构与结构数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17924705/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Structure of Arrays vs Array of Structures in CUDA
提问by BugShotGG
From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays
(SoA
) over Array of Structures
(AoS
) for parallel implementations like CUDA? If that is true, can anyone explain why?
Thanks in advance!
从我在这里读到的一些评论中,出于某种原因,对于像 CUDA 这样的并行实现,最好将Structure of Arrays
( SoA
) 放在Array of Structures
( AoS
)之上?如果这是真的,谁能解释为什么?提前致谢!
回答by Paul R
Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).
选择 AoS 还是 SoA 以获得最佳性能通常取决于访问模式。然而,这不仅限于 CUDA - 类似的考虑适用于性能可能受内存访问模式显着影响的任何架构,例如您有缓存或连续内存访问性能更好的地方(例如 CUDA 中的合并内存访问)。
E.g. for RGB pixels versus separate RGB planes:
例如对于 RGB 像素与单独的 RGB 平面:
struct {
uint8_t r, g, b;
} AoS[N];
struct {
uint8_t r[N];
uint8_t g[N];
uint8_t b[N];
} SoA;
If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.
如果您要同时访问每个像素的 R/G/B 组件,那么 AoS 通常是有意义的,因为 R、G、B 组件的连续读取将是连续的,并且通常包含在同一缓存行中。对于 CUDA,这也意味着内存读/写合并。
However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.
但是,如果您要单独处理颜色平面,那么 SoA 可能是首选,例如,如果您想按某个比例因子缩放所有 R 值,那么 SoA 意味着所有 R 组件将是连续的。
One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.
另一个考虑因素是填充/对齐。对于上面的 RGB 示例,AoS 布局中的每个元素都对齐到 3 个字节的倍数,这对于 CUDA、SIMD 等可能不方便 - 在某些情况下甚至可能需要在结构内填充以使对齐更方便(例如添加一个虚拟 uint8_t 元素以确保 4 字节对齐)。然而,在 SoA 情况下,平面是字节对齐的,这对于某些算法/架构来说可能更方便。
For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.
对于大多数图像处理类型的应用程序,AoS 场景更为常见,但对于其他应用程序或特定的图像处理任务,情况可能并非总是如此。如果没有明显的选择,我会推荐 AoS 作为默认选择。
See also this answerfor more general discussion of AoS v SoA.
有关AoS 与 SoA 的更一般性讨论,另请参阅此答案。
回答by HymanOLantern
I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).
我只想提供一个简单的示例,说明数组结构 (SoA) 的性能如何优于结构数组 (AoS)。
In the example, I'm considering three different versions of the same code:
在示例中,我正在考虑相同代码的三个不同版本:
- SoA (v1)
- Straight arrays (v2)
- AoS (v3)
- 系统架构 (v1)
- 直阵列 (v2)
- AoS (v3)
In particular, version 2
considers the use of straight arrays. The timings of versions 2
and 3
are the same for this example and result to be better than version 1
. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__
for this case.
特别是,版本2
考虑了使用直线阵列。版本2
和的时间在3
这个例子中是相同的,结果比 version 更好1
。我怀疑,一般来说,直数组可能更可取,尽管以可读性为代价,因为例如,const __restrict__
在这种情况下可以启用从统一缓存加载。
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <thrust\device_vector.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define BLOCKSIZE 1024
/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {
unsigned int x1;
unsigned int x2;
unsigned int code;
bool done;
};
/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {
unsigned int *x1;
unsigned int *x2;
unsigned int *code;
bool *done;
};
/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cellAoS tempCell = d_cells[tid];
tempCell.x1 = tempCell.x1 + 10;
tempCell.x2 = tempCell.x2 + 10;
d_cells[tid] = tempCell;
}
}
/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
d_x1[tid] = d_x1[tid] + 10;
d_x2[tid] = d_x2[tid] + 10;
}
}
/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cell.x1[tid] = cell.x1[tid] + 10;
cell.x2[tid] = cell.x2[tid] + 10;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 2048 * 2048 * 4;
TimingGPU timerGPU;
thrust::host_vector<cellAoS> h_cells(N);
thrust::device_vector<cellAoS> d_cells(N);
thrust::host_vector<unsigned int> h_x1(N);
thrust::host_vector<unsigned int> h_x2(N);
thrust::device_vector<unsigned int> d_x1(N);
thrust::device_vector<unsigned int> d_x2(N);
for (int k = 0; k < N; k++) {
h_cells[k].x1 = k + 1;
h_cells[k].x2 = k + 2;
h_cells[k].code = k + 3;
h_cells[k].done = true;
h_x1[k] = k + 1;
h_x2[k] = k + 2;
}
d_cells = h_cells;
d_x1 = h_x1;
d_x2 = h_x2;
cellSoA cell;
cell.x1 = thrust::raw_pointer_cast(d_x1.data());
cell.x2 = thrust::raw_pointer_cast(d_x2.data());
cell.code = NULL;
cell.done = NULL;
timerGPU.StartCounter();
AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());
//timerGPU.StartCounter();
//AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
//gpuErrchk(cudaPeekAtLastError());
//gpuErrchk(cudaDeviceSynchronize());
//printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());
timerGPU.StartCounter();
AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());
h_cells = d_cells;
h_x1 = d_x1;
h_x2 = d_x2;
// --- Check results
for (int k = 0; k < N; k++) {
if (h_x1[k] != k + 11) {
printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
break;
}
if (h_x2[k] != k + 12) {
printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
break;
}
if (h_cells[k].x1 != k + 11) {
printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
break;
}
if (h_cells[k].x2 != k + 12) {
printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
break;
}
}
}
The following are the timings (runs performed on a GTX960):
以下是时序(在 GTX960 上执行的运行):
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)
回答by alexbuisson
SoA is effectly good for SIMD processing. For several reason, but basically it's more efficient to load 4 consecutive floats in a register. With something like:
SoA 非常适合 SIMD 处理。出于多种原因,但基本上在寄存器中加载 4 个连续浮点数更有效。像这样:
float v [4] = {0};
__m128 reg = _mm_load_ps( v );
than using:
比使用:
struct vec { float x; float, y; ....} ;
vec v = {0, 0, 0, 0};
and create an __m128
data by accessing all member:
并__m128
通过访问所有成员创建数据:
__m128 reg = _mm_set_ps(v.x, ....);
if your arrays are 16-byte aligned data load/store are faster and some op can be perform directly in memory.
如果您的数组是 16 字节对齐的数据加载/存储速度更快,并且某些操作可以直接在内存中执行。