C++ SSE 指令添加数组的所有元素

Question

提问by geeta

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8which can add two array elements. But I want an SSE instruction which can add all elements of an array.

我是 SSE2 指令的新手。我找到了一个_mm_add_epi8可以添加两个数组元素的指令。但我想要一个可以添加数组所有元素的 SSE 指令。

I was trying to develop this concept using this code:

我试图使用以下代码开发这个概念：

#include <iostream>
#include <conio.h>
#include <emmintrin.h>

void sse(unsigned char* a,unsigned char* b); 

void main()
{
    /*unsigned char *arr;
    arr=(unsigned char *)malloc(50);*/

    unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r','a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r'};
    unsigned char *next_arr=arr+16;
    for(int i=0;i<16;i++)
          printf("%d,%c   ",next_arr[i],next_arr[i]);
    sse(arr,next_arr);

    getch();
}

void sse(unsigned char* a,unsigned char* b)                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                                                                                            
  __m128i* l = (__m128i*)a;                                                                                                                                                                                      
  __m128i* r = (__m128i*)b; 
  __m128i result;

      result= _mm_add_epi8(*l, *r);

      unsigned char *p;
         p=(unsigned char *)&result;

        for(int i=0;i<16;i++)
          printf("%d ",p[i]);

         printf("\n");
         l=(__m128i*)p;
         r=(__m128i*)(p+8);         
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         printf("%d ",p[0]);

         l=(__m128i*)p;
         r=(__m128i*)(p+4);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+2);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+1);
         result=_mm_add_epi8(*l, *r);
          p=(unsigned char *)&result;
            printf("result =%d ",p[0]);
}

So can anybody please tell me how it is possible to add all elements of an array using SSE2 instructions ?

那么有人可以告诉我如何使用 SSE2 指令添加数组的所有元素吗？

Any help will be appreciated.

任何帮助将不胜感激。

Answer 1

回答by Paul R

If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

如果您只想对数组的所有元素求和，则需要加载数据，将其解压缩为更大的元素大小，然后对解压缩的元素求和。请注意，您可以保持多个部分总和直到循环结束，然后只对这些部分总和进行最后一个总和。例如：

uint32_t sum_array(const uint8_t a[], int n)
{
    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8
    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16
    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums
    uint32_t sum;
    int i;

    for (i = 0; i < n; i += 16)
    {
        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values
        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
                                                // unpack and accumulate 16 bit values to
                                                // 32 bit partial sum vector

    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16(PMADDWD) with a multiplicand of 1 and _mm_add_epi32to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

请注意，上面的代码中有一个不明显的技巧——而不是将每个 16 位向量进一步解包为一对 32 位向量（需要 4 条解包指令），然后使用四个 32 位加法（另外 4 条指令），我们使用_mm_madd_epi16( PMADDWD) 的被乘数为 1 并_mm_add_epi32有效地为我们提供免费解包，因此我们使用 4 条指令而不是 8 条指令获得相同的结果。

Note also that the input array, a[], needs to be 16 byte aligned, and nshould be a multiple of 16.

另请注意，输入数组a[]需要 16 字节对齐，并且n应该是 16 的倍数。

C++ SSE 指令添加数组的所有元素

提问by geeta

回答by Paul R

相关推荐

最近更新

标签

C++ SSE 指令添加数组的所有元素

提问by geeta

回答by Paul R

相关推荐

在 C++ OpenMP 代码中测量执行时间

为 C/C++ 中的项目生成 makefile 的依赖项

C++ 如何使用 OpenCV 2.4.3 和 Code::Blocks 编译程序？

C++ 应该永远不要使用静态内联函数吗？

相关推荐

最近更新

标签