windows 如何使用`int32_t`值快速填充内存?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/3212649/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to fill memory fast with a `int32_t` value?
提问by Kirill V. Lyadvinsky
Is there a function (SSEx intrinsics is OK) which will fill the memory with a specified int32_tvalue? For instance, when this value is equal to 0xAABBCC00the result memory should look like:
是否有一个函数(SSEx 内在函数可以)用指定的int32_t值填充内存?例如,当此值等于0xAABBCC00结果内存时,应如下所示:
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
...
I could use std::fillor simple for-loop, but it is not fast enough.
我可以使用std::fill或简单的 for 循环,但速度不够快。
Resizing of a vector performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.
在程序开始时只执行一次向量调整大小,这不是问题。瓶颈是填满内存。
Simplified code:
简化代码:
struct X
{
  typedef std::vector<int32_t> int_vec_t;
  int_vec_t buffer;
  X() : buffer( 5000000 ) { /* some more action */ }
  ~X() { /* some code here */ }
  // the following function is called 25 times per second
  const int_vec_t& process( int32_t background, const SOME_DATA& data );
};
const X::int_vec_t& X::process( int32_t background, const SOME_DATA& data )
{
    // the following one string takes 30% of total time of #process function
    std::fill( buffer.begin(), buffer.end(), background );
    // some processing
    // ...
    return buffer;
}
采纳答案by Kirill V. Lyadvinsky
Thanks to everyone for your answers. I've checked wj32's solution , but it shows very similar time as std::filldo. My current solution works 4 times faster (in Visual Studio 2008) than std::fillwith help of the function memcpy:
感谢大家的回答。我已经检查了wj32 的解决方案,但它显示的时间与std::fill所做的非常相似。我当前的解决方案的运行速度(在 Visual Studio 2008 中)比std::fill在函数的帮助下快 4 倍memcpy:
 // fill the first quarter by the usual way
 std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
 // copy the first quarter to the second (very fast)
 memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
 // copy the first half to the second (very fast)
 memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));
In the production code one needs to add check if buffer.size()is divisible by 4 and add appropriate handling for that.
在生产代码中,需要添加检查是否buffer.size()可以被 4 整除并为此添加适当的处理。
回答by wj32
This is how I would do it (please excuse the Microsoft-ness of it):
这就是我要做的(请原谅它的微软性):
VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
{
    __m128i f;
    // Fix mis-alignment.
    if ((ULONG_PTR)M & 0xf)
    {
        switch ((ULONG_PTR)M & 0xf)
        {
            case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }
        }
    }
    f.m128i_i32[0] = Fill;
    f.m128i_i32[1] = Fill;
    f.m128i_i32[2] = Fill;
    f.m128i_i32[3] = Fill;
    while (Count >= 4)
    {
        _mm_store_si128((__m128i *)M, f);
        M += 4;
        Count -= 4;
    }
    // Fill remaining LONGs.
    switch (Count & 0x3)
    {
        case 0x3: *M++ = Fill;
        case 0x2: *M++ = Fill;
        case 0x1: *M++ = Fill;
    }
}
回答by Mark B
I have to ask: Have you definitely profiled std::filland shown it to be the performance bottleneck? I would guess it to be implemented in a pretty efficient manner, such that the compiler can automatically generate the appropriate instructions (for example -marchon gcc).
我不得不问:你有没有明确地分析过std::fill并证明它是性能瓶颈?我猜它会以一种非常有效的方式实现,这样编译器就可以自动生成适当的指令(例如-march在 gcc 上)。
If it is the bottleneck, it may still be possible to get better benefit from an algorithmic redesign (if possible) to avoid setting so much memory (apparently over and over) such that it doesn't matter anymore which fill mechanism you use.
如果这是瓶颈,仍然有可能从算法重新设计(如果可能)中获得更好的好处,以避免设置太多内存(显然是一遍又一遍),这样您使用哪种填充机制就不再重要了。
回答by wheaties
Have you considered using
你有没有考虑使用
vector<int32_t> myVector;
myVector.reserve( sizeIWant );
and then use std::fill?  Or perhaps the constructor of a std::vectorwhich takes as an argument the number of items held and the value to initialize them at?
然后使用 std::fill? 或者也许 a 的构造std::vector函数将持有的项目数量和初始化它们的值作为参数?
回答by UnixShadow
Not totally sure how you set 4 bytes in a row, but if you want to fill memory with just one byte over an over again, you can use memset.
不完全确定如何连续设置 4 个字节,但如果您想一次又一次地只用一个字节填充内存,您可以使用memset.
void * memset ( void * ptr, int value, size_t num );Fill block of memory
Sets the first num bytes of the block of memory pointed by
ptrto the specified value (interpreted as anunsigned char).
void * memset ( void * ptr, int value, size_t num );填充内存块
将指向的内存块的前 num 个字节设置为
ptr指定值(解释为unsigned char)。
回答by Viktor Sehr
Assuming you have a limited amount of values in your background parameter (or even better, only on), maybe you should try to allocate a static vector, and simply use memcpy.
假设您的背景参数中的值数量有限(或者更好,只有 on),也许您应该尝试分配一个静态向量,并简单地使用 memcpy。
const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);
    const X::int_vec_t& X::process( const SOME_DATA& data )
    {
        // the following one string takes 30% of total time of #process function
        std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));
        // some processing
        // ...
        return buffer;
    }
回答by acui
the vs2013 and vs2015 can optimize a plain for-loop to a rep stosinstruction. It's the fastest way to fill a buffer. You can specify the std::fillfor your type like this:
vs2013 和 vs2015 可以优化一个普通的 for 循环到一条rep stos指令。这是填充缓冲区的最快方法。您可以std::fill像这样为您的类型指定:
namespace std {
    inline void fill(vector<int>::iterator first, vector<int>::iterator last, int value){
        for (size_t i = 0; i < last - first; i++)
            first[i] = value;
    }
}
BTW. To have the compiler do the optimization, the buffer must be accessed by the subscript operator.
顺便提一句。要让编译器进行优化,必须通过下标运算符访问缓冲区。
It will not work on the gcc and clang. They both will compile the code to a conditional jump loop. It runs as slow as the original std::fill. And though the wchar_tis 32-bit, the wmemsetdoes not have an assemble implement likes the memset. So you have to write assemble code to do the optimization.
它不适用于 gcc 和 clang。他们都将代码编译为条件跳转循环。它的运行速度和原来一样慢std::fill。虽然它wchar_t是 32 位的,wmemset但没有像memset. 所以你必须编写汇编代码来进行优化。
回答by 5ound
I just tested std::fill with g++ with full optimizations (SSE etc.. enabled):
我刚刚测试了 std::fill with g++ 并进行了全面优化(启用了 SSE 等):
#include <algorithm>
#include <inttypes.h>
int32_t a[5000000];
int main(int argc,char *argv[])
{
    std::fill(a,a+5000000,0xAABBCC00);
    return a[3];
}
and the inner loop looked like:
内部循环看起来像:
L2:
    movdqa  %xmm0, -16(%eax)
    addl    , %eax
    cmpl    %edx, %eax
    jne L2
Looks like 0xAABBCC00 x 4 was loaded into xmm0 and is being moved 16-bytes at a time.
看起来 0xAABBCC00 x 4 已加载到 xmm0 中,并且每次移动 16 字节。
回答by Jay
It might be a bit non portable but you could use an overlapping memory copy. Fill the first four bytes with the pattern you want and use memcpy().
它可能有点不可移植,但您可以使用重叠的内存副本。用您想要的模式填充前四个字节并使用 memcpy()。
int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );
don't think you can get much faster
不要认为你可以变得更快

