C++ SSE 指令添加数组的所有元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10930595/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:37:26  来源:igfitidea点击:

SSE instructions to add all elements of an array

c++arraysssesimdsse2

提问by geeta

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8which can add two array elements. But I want an SSE instruction which can add all elements of an array.

我是 SSE2 指令的新手。我找到了一个_mm_add_epi8可以添加两个数组元素的指令。但我想要一个可以添加数组所有元素的 SSE 指令。

I was trying to develop this concept using this code:

我试图使用以下代码开发这个概念:

#include <iostream>
#include <conio.h>
#include <emmintrin.h>

void sse(unsigned char* a,unsigned char* b); 

void main()
{
    /*unsigned char *arr;
    arr=(unsigned char *)malloc(50);*/

    unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r','a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r'};
    unsigned char *next_arr=arr+16;
    for(int i=0;i<16;i++)
          printf("%d,%c   ",next_arr[i],next_arr[i]);
    sse(arr,next_arr);

    getch();
}

void sse(unsigned char* a,unsigned char* b)                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                                                                                            
  __m128i* l = (__m128i*)a;                                                                                                                                                                                      
  __m128i* r = (__m128i*)b; 
  __m128i result;

      result= _mm_add_epi8(*l, *r);

      unsigned char *p;
         p=(unsigned char *)&result;

        for(int i=0;i<16;i++)
          printf("%d ",p[i]);

         printf("\n");
         l=(__m128i*)p;
         r=(__m128i*)(p+8);         
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         printf("%d ",p[0]);

         l=(__m128i*)p;
         r=(__m128i*)(p+4);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+2);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+1);
         result=_mm_add_epi8(*l, *r);
          p=(unsigned char *)&result;
            printf("result =%d ",p[0]);
}

So can anybody please tell me how it is possible to add all elements of an array using SSE2 instructions ?

那么有人可以告诉我如何使用 SSE2 指令添加数组的所有元素吗?

Any help will be appreciated.

任何帮助将不胜感激。

回答by Paul R

If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

如果您只想对数组的所有元素求和,则需要加载数据,将其解压缩为更大的元素大小,然后对解压缩的元素求和。请注意,您可以保持多个部分总和直到循环结束,然后只对这些部分总和进行最后一个总和。例如:

uint32_t sum_array(const uint8_t a[], int n)
{
    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8
    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16
    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums
    uint32_t sum;
    int i;

    for (i = 0; i < n; i += 16)
    {
        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values
        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
                                                // unpack and accumulate 16 bit values to
                                                // 32 bit partial sum vector

    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16(PMADDWD) with a multiplicand of 1 and _mm_add_epi32to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

请注意,上面的代码中有一个不明显的技巧——而不是将每个 16 位向量进一步解包为一对 32 位向量(需要 4 条解包指令),然后使用四个 32 位加法(另外 4 条指令),我们使用_mm_madd_epi16( PMADDWD) 的被乘数为 1 并_mm_add_epi32有效地为我们提供免费解包,因此我们使用 4 条指令而不是 8 条指令获得相同的结果。

Note also that the input array, a[], needs to be 16 byte aligned, and nshould be a multiple of 16.

另请注意,输入数组a[]需要 16 字节对齐,并且n应该是 16 的倍数。