C语言 替换数组中元素的快速方法 - C

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16231110/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 06:11:06  来源:igfitidea点击:

Fast way to replace elements in array - C

carraysperformance

提问by Axarydax

Let's say we have an array of ints like this:

假设我们有一个这样的整数数组:

const int size = 100000;
int array[size];
//set some items to 0 and other items to 1

I'd like to replace all items that have value of 1 with another value, for example 123456. This can be trivially implemented with:

我想用另一个值替换所有值为 1 的项目,例如 123456。这可以通过以下方式轻松实现:

for(int i = 0; i < size ; i++){
    if(array[i] != 0) 
        array[i] = 123456;
}

Out of curiosity, is there a faster way to do this, by some kind of x86 trickery, or is this the best code for the processor?

出于好奇,是否有更快的方法来做到这一点,通过某种 x86 技巧,或者这是处理器的最佳代码?

回答by Nicu Stiurca

For your specific case where you initially have 0 and 1, the following mightbe faster. You'll have to bench mark it. You probably can't do much better with plain C though; you may need to dive into assembly if you want to take advantage of "x86 trickery" that may exist.

对于您最初有 0 和 1 的特定情况,以下可能会更快。你必须对它进行基准测试。不过,使用普通的 C 可能不会做得更好;如果您想利用可能存在的“x86 技巧”,您可能需要深入研究汇编。

for(int i = 0; i < size ; i++){
  array[i] *= 123456;
}

EDIT:

编辑:

Benchmark code:

基准代码:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>

size_t diff(struct timespec *start, struct timespec *end)
{
  return (end->tv_sec - start->tv_sec)*1000000000 + end->tv_nsec - start->tv_nsec;
}

int main(void)
{
  const size_t size = 1000000;
  int array[size];

  for(size_t i=0; i<size; ++i) {
    array[i] = rand() & 1;
  }

  struct timespec start, stop;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
  for(size_t i=0; i<size; ++i) {
    array[i] *= 123456;
    //if(array[i]) array[i] = 123456;
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stop);

  printf("size: %zu\t nsec: %09zu\n", size, diff(&start, &stop));
}


my results:

我的结果:

Computer: quad core AMD Phenom @2.5GHz, Linux, GCC 4.7, compiled with

计算机:四核 AMD Phenom @2.5GHz,Linux,GCC 4.7,编译

$ gcc arr.c -std=gnu99 -lrt -O3 -march=native
  • ifversion: ~5-10ms
  • *=version: ~1.3ms
  • if版本:~5-10ms
  • *=版本:~1.3ms

回答by Some programmer dude

For a small array such as your it's no use trying to find another algorithm, and if the values are not in a specific pattern a simple loop is the only way to do it anyway.

对于像您这样的小数组,尝试找到另一种算法是没有用的,并且如果值不是特定模式,那么无论如何,简单的循环是唯一的方法。

However, if you have a very large array (we're talking several million entries), then you can split the work into threads. Each separate thread handles a smaller portion of the whole data set.

但是,如果您有一个非常大的数组(我们说的是几百万个条目),那么您可以将工作拆分为线程。每个单独的线程处理整个数据集的一小部分。

回答by gwiazdorrr

You might want to benchmark this as well:

您可能还想对此进行基准测试:

for(int i = 0; i < size ; i++){
  array[i] = (~(array[i]-1) & 123456);
}

I run it through same benchmark as SchighSchagh, with little or no difference on my set up. It may differ on yours, however.

我通过与 SchighSchagh 相同的基准运行它,在我的设置上几乎没有差异。但是,它可能因您而异。

EDIT: Stop the presses!

编辑:停止印刷机!

I just remembered that x86 can "unbranch" ternary operators if arguments between ":" are constants. Consider following code:

我只记得如果“:”之间的参数是常量,x86 可以“取消分支”三元运算符。考虑以下代码:

for(size_t i=0; i<size; ++i) {
    array[i] = array[i] ? 123456 : 0;
}

Looks almost like your original code, doesn't it? Well, disassembly shows that it has been compiled without any branches:

看起来很像你的原始代码,不是吗?嗯,反汇编显示已经编译完成,没有任何分支:

  for(size_t i=0; i<size; ++i) {
00E3104C  xor         eax,eax  
00E3104E  mov         edi,edi  
        array[i] = array[i] ? 123456 : 0;
00E31050  mov         edx,dword ptr [esi+eax*4]  
00E31053  neg         edx  
00E31055  sbb         edx,edx  
00E31057  and         edx,1E240h  
00E3105D  mov         dword ptr [esi+eax*4],edx  
00E31060  inc         eax  
00E31061  cmp         eax,5F5E100h  
00E31066  jb          wmain+50h (0E31050h)  
    }

Performance-wise it seems on par or little better than my original and SchighSchagh solution. It is more readable and more flexible, though. For instance, it can work with array[i] having values different than 0 and 1.

在性能方面,它似乎与我的原始和 SchighSchagh 解决方案相当或略胜一筹。不过,它更具可读性和灵活性。例如,它可以处理具有不同于 0 和 1 的值的 array[i]。

Bottom line, benchmark AND take a peek in the disassembly.

最重要的是,基准测试并查看反汇编。

回答by harold

The array is small enough that it fits in cache, so it should be worthwhile to use SIMD: (not tested)

该数组足够小,可以放入缓存中,因此使用 SIMD 应该是值得的:(未测试)

  mov ecx, size
  lea esi, [array + ecx * 4]
  neg ecx
  pxor xmm0, xmm0
  movdqa xmm1, [_vec4_123456]  ; value of { 123456, 123456, 123456, 123456 }
_replaceloop:
  movdqa xmm2, [esi + ecx * 4] ; assumes the array is 16 aligned, make that true
  add ecx, 4
  pcmpeqd xmm2, xmm0
  pandn xmm2, xmm1
  movdqa [esi + ecx * 4 - 16], xmm2
  jnz _replaceloop

Unrolling by 2 might help.

展开 2 可能会有所帮助。

If you have SSE4.1, you can use SchighSchagh's multiplication trick with pmulld.

如果您有 SSE4.1,则可以将 SchighSchagh 的乘法技巧与pmulld.

回答by Skizz

Here's some Win32 code to profile various versions of the algorithm (compiled using VS2010 Express using default release build):-

这里有一些 Win32 代码来分析算法的各种版本(使用 VS2010 Express 使用默认发布版本编译):-

#include <windows.h>
#include <stdlib.h>
#include <stdio.h>

const size_t
  size = 0x1D4C00;

_declspec(align(16)) int
  g_array [size];

_declspec(align(16)) int
  _vec4_123456 [] = { 123456, 123456, 123456, 123456 };

void Test (void (*fn) (size_t, int *), char *test)
{
  printf ("Executing test: %s\t", test);

  for(size_t i=0; i<size; ++i) {
    g_array[i] = rand() & 1;
  }

  LARGE_INTEGER
    start,
    end;

  QueryPerformanceCounter (&start);

  fn (size, g_array);

  QueryPerformanceCounter (&end);

  printf("size: %u\t count: %09u\n", size, (int) (end.QuadPart - start.QuadPart));
}

void Test1 (size_t size, int *array)
{
  for(size_t i=0; i<size; ++i) {
    array[i] *= 123456;
  }
}

void Test2 (size_t size, int *array)
{
  for(size_t i=0; i<size; ++i) {
    if(array[i]) array[i] = 123456;
  }
}

void Test3 (size_t array_size, int *array)
{
  __asm
  {
    mov edi,array
    mov ecx, array_size 
    lea esi, [edi + ecx * 4]
    neg ecx
    pxor xmm0, xmm0
    movdqa xmm1, [_vec4_123456]  ; value of { 123456, 123456, 123456, 123456 }
_replaceloop:
    movdqa xmm2, [esi + ecx * 4] ; assumes the array is 16 aligned, make that true
    add ecx, 4
    pcmpeqd xmm2, xmm0
    pandn xmm2, xmm1
    movdqa [esi + ecx * 4 - 16], xmm2
    jnz _replaceloop
  }
}

void Test4 (size_t array_size, int *array)
{
  array_size = array_size * 8 / 12;

  __asm
  {
        mov edi,array
        mov ecx,array_size
        lea esi,[edi+ecx*4]
                                      lea edi,[edi+ecx*4]
        neg ecx
                                      mov edx,[_vec4_123456]
        pxor xmm0,xmm0
        movdqa xmm1,[_vec4_123456]
replaceloop:
        movdqa xmm2,[esi+ecx*4]
                                      mov eax,[edi]
                                      mov ebx,[edi+4]
        movdqa xmm3,[esi+ecx*4+16]
                                      add edi,16
        add ecx,9
                                      imul eax,edx    
        pcmpeqd xmm2,xmm0
                                      imul ebx,edx
        pcmpeqd xmm3,xmm0
                                      mov [edi-16],eax
                                      mov [edi-12],ebx
        pandn xmm2,xmm1
                                      mov eax,[edi-8]
                                      mov ebx,[edi-4]
        pandn xmm3,xmm1
                                      imul eax,edx    
        movdqa [esi+ecx*4-36],xmm2
                                      imul ebx,edx
        movdqa [esi+ecx*4-20],xmm3
                                      mov [edi-8],eax
                                      mov [edi-4],ebx
        loop replaceloop
  }
}

int main()
{
    Test (Test1, "Test1 - mul");
    Test (Test2, "Test2 - branch");
    Test (Test3, "Test3 - simd");
    Test (Test4, "Test4 - simdv2");
}

It's got for tests: C using an if()..., C using a multiply, harold's simd version and my simd version.

它用于测试:C 使用if()...,C 使用乘法,harold 的 simd 版本和我的 simd 版本。

Running it many times (remember, when profiling you should average the results over several runs) there's little difference between all the versions except the branching one which is significantly slower.

多次运行它(记住,在分析时你应该对多次运行的结果进行平均)除了分支版本之外,所有版本之间几乎没有区别,它的速度要慢得多。

This is not very surprising as the algortihm is doing very little work for each memory item. What this means is that the real limiting factor is the bandwidth between the CPU and the memory, the CPU is constantly waiting for the memory to catch up, even with the cpu helping with prefetching the data (ia32's detect and prefetch data linearly).

这并不奇怪,因为算法为每个内存项做的工作很少。这意味着真正的限制因素是 CPU 和内存之间的带宽,CPU 一直在等待内存跟上,即使 CPU 帮助预取数据(ia32 的检测和预取数据是线性的)。

回答by Abhinav

This might prove faster.

这可能会证明更快。

for(int i = 0; i < size ; i++){
  array[i] = ((123456 << array[i]) - 123456);
}

EDIT: Changed bitwise operation to left shift.

编辑:将按位运算更改为左移。

回答by Sven

You could use another array or some other data structure to keep track of the indices of the elements you set to one and then only visit those elements. This will work best if there are only few elements that are set to one

您可以使用另一个数组或其他一些数据结构来跟踪您设置为 1 的元素的索引,然后只访问这些元素。如果只有少数元素设置为 1,这将最有效

回答by Mohanraj

one more way to speed up the array assignment you can use the c inline assembly. Like below,

可以使用 c 内联程序集加速数组分配的另一种方法。像下面,

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

const int size = 100000; 
void main(void) {
  int array[size];
  int value = 1000;

  __asm__ __volatile__("cld\n\t"
          "rep\n\t"
          "stosl\n\t"
          :
          :"c"(size*4), "a"(value), "D"(array)
          :
         );

  printf("Array[0] : %d \n", array[0]);
}

This should be speed when we compared to plain c program to assign the array values. And also the stoslinstruction take 4 clock cycle.

当我们与普通的 c 程序相比分配数组值时,这应该是速度。并且stosl指令需要 4 个时钟周期。