C语言找出最少 3 个数字的最快方法？

Question

提问by Sudhanshu

In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:

在我编写的一个程序中，在这个例程中，20% 的时间用于找出内部循环中 3 个数字的最小值：

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    unsigned int m = a;
    if (m > b) m = b;
    if (m > c) m = c;
    return m;
}

Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.

有没有办法加快这个速度？我也可以使用 x86/x86_64 的汇编代码。

Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :)
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.

编辑：回复一些评论：
* 正在使用的编译器是 gcc 4.3.3
* 就汇编而言，我只是那里的初学者。我要求在这里组装，以了解如何做到这一点。:)
* 我运行的是四核 Intel 64，因此支持 MMX/SSE 等。
* 很难在此处发布循环，但我可以告诉您，这是对 levenshtein 算法进行了大量优化的实现。

This is what the compiler is giving me for the non-inlined version of min:

这是编译器为 min 的非内联版本提供的内容：

.globl min
    .type   min, @function
min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %eax
    movl    16(%ebp), %ecx
    cmpl    %edx, %eax
    jbe .L2
    movl    %edx, %eax
.L2:
    cmpl    %ecx, %eax
    jbe .L3
    movl    %ecx, %eax
.L3:
    popl    %ebp
    ret
    .size   min, .-min
    .ident  "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
    .section    .note.GNU-stack,"",@progbits

The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.

内联版本在 -O2 优化代码内（甚至我的标记 mrk = 0xfefefefe，在调用 min() 之前和之后）都被 gcc 优化掉了，所以我无法掌握它。

Update:I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.

更新：我测试了 Nils 建议的更改，ephemient，但是通过使用 min() 的程序集版本我没有获得明显的性能提升。但是，通过使用 -march=i686 编译程序，我得到了 12.5% 的提升，我想这是因为整个程序都获得了 gcc 使用此选项生成的新更快指令的好处。谢谢你们的帮助。

P.S. - I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.

PS - 我使用了 ruby profiler 来测量性能（我的 C 程序是一个由 ruby 程序加载的共享库），所以我只能花在 ruby 程序调用的顶级 C 函数上的时间，最终调用 min () 下堆栈。请看这个问题。

Answer 1

采纳答案by bdonlan

Make sure you are using an appropriate -marchsetting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2I get:

首先确保您使用了适当的-march设置。GCC 默认不使用原始 i386 不支持的任何指令 - 允许它使用更新的指令集有时会产生很大的不同！在-march=core2 -O2我得到：

min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %ecx
    movl    16(%ebp), %eax
    cmpl    %edx, %ecx
    leave
    cmovbe  %ecx, %edx
    cmpl    %eax, %edx
    cmovbe  %edx, %eax
    ret

The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.

在这里使用 cmov 可以帮助您避免分支延迟 - 并且您只需传入-march. 当内联到一个更大的函数中时，这可能会更有效率，可能只有四个组装操作。如果你需要比这更快的东西，看看你是否可以让 SSE 向量操作在你的整体算法的上下文中工作。

Answer 2

回答by Stephen Canon

Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.

假设你的编译器没有出去吃午饭，这应该编译成两个比较和两个条件移动。不可能做得比这更好。

If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.

如果您发布编译器实际生成的程序集，我们可以查看是否有任何不必要的东西会减慢它的速度。

The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.

要检查的第一件事是例程实际上是内联的。编译器没有义务这样做，如果它正在生成一个函数调用，那么对于这样一个简单的操作来说，这将是非常昂贵的。

If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.

如果调用真的被内联，那么循环展开可能是有益的，正如 DigitalRoss 所说，或者矢量化可能是可能的。

Edit:If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminudinstruction (intrinsic: _mm_min_epu32), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.

编辑：如果您想对代码进行矢量化，并使用最新的 x86 处理器，您将需要使用 SSE4.1pminud指令（内部：）_mm_min_epu32，它需要两个向量，每个向量有四个无符号整数，并产生一个四个无符号整数的向量整数。结果的每个元素都是两个输入中对应元素的最小值。

I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.

我还注意到你的编译器使用了分支而不是条件移动；您可能应该先尝试使用条件移动的版本，看看在您开始进行矢量实现的竞赛之前是否可以提高速度。

Answer 3

回答by Nils Pipenbrinck

My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:

我对 x86 汇编器实现的看法，GCC 语法。转换为另一种内联汇编器语法应该很简单：

int inline least (int a, int b, int c)
{
  int result;
  __asm__ ("mov     %1, %0\n\t"
           "cmp     %0, %2\n\t" 
           "cmovle  %2, %0\n\t"
           "cmp     %0, %3\n\t"
           "cmovle  %3, %0\n\t" 
          : "=r"(result) : 
            "r"(a), "r"(b), "r"(c)
          );
  return result;
}

New and improved version:

新的和改进的版本：

int inline least (int a, int b, int c)
{
  __asm__ (
           "cmp     %0, %1\n\t" 
           "cmovle  %1, %0\n\t"
           "cmp     %0, %2\n\t"
           "cmovle  %2, %0\n\t" 
          : "+r"(a) : 
            "%r"(b), "r"(c)
          );
  return a;
}

NOTE: It may or may not be faster than C code.

注意：它可能比 C 代码快，也可能不快。

This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..

这取决于很多因素。如果分支不可预测（在某些 x86 架构上），通常 cmov 会获胜

Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.

顺便说一句，Sudhanshu，听到这段代码如何处理您的测试数据会很有趣。

Answer 4

回答by ephemient

This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:

在我的 AMD Phenom 上，这种直接替换的时钟快了大约 1.5%：

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    asm("cmp   %1,%0\n"
        "cmova %1,%0\n"
        "cmp   %2,%0\n"
        "cmova %2,%0\n"
        : "+r" (a) : "r" (b), "r" (c));
    return a;
}

Results may vary; some x86 processors don't handle CMOV very well.

结果可能会有所不同；一些 x86 处理器不能很好地处理 CMOV。

Answer 5

回答by Mark Ransom

The SSE2 instruction extensions contain an integer mininstruction that can choose 8 minimums at a time. See _mm_mulhi_epu16in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm

SSE2 指令扩展包含一个整数min指令，可以一次选择 8 个最小值。见_mm_mulhi_epu16在http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm

Answer 6

回答by Ken

First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.

首先看拆解。那会告诉你很多。例如，正如所写的那样，有 2 个 if 语句（这意味着有 2 个可能的分支错误预测），但我的猜测是，一个体面的现代 C 编译器将有一些聪明的优化，可以在没有分支的情况下完成它。我很想知道。

Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.

其次，如果您的 libc 具有特殊的内置最小/最大函数，请使用它们。例如，GNU libc 具有用于浮点数的 fmin/fmax，并且他们声称“在某些处理器上，这些函数可以使用特殊的机器指令比等效的 C 代码更快地执行这些操作”。也许 uint 也有类似的东西。

Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.

最后，如果您对一堆数字并行执行此操作，则可能有向量指令可以执行此操作，这可以显着提高速度。但我什至看到使用向量单位时非向量代码更快。诸如“将一个 uint 加载到向量寄存器中，调用向量最小函数，得到结果”之类的东西看起来很愚蠢，但实际上可能更快。

Answer 7

回答by DigitalRoss

If you are only doing one comparison you might want to unroll the loop manually.

如果您只进行一项比较，您可能需要手动展开循环。

First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...

首先，看看是否可以让编译器为您展开循环，如果不能，请自己动手。这至少会减少循环控制开销......

Answer 8

回答by Paul Sasik

You could try something like this to save on declaration and unnecessary comparisons:

你可以尝试这样的事情来节省声明和不必要的比较：

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{ 
    if (a < b)
    {
        if (a < c) 
             return a; 
        else 
             return c;
    }

    if (b < c)
        return b;
    else return c;
}

Answer 9

回答by Hamish Grubijan

Yes, post assembly, but my naive optimization is:

是的，组装后，但我天真的优化是：

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    unsigned int m = a;
    if (m > b) m = b;
    if (m > c) return c;
    return m;
}

Answer 10

回答by Mike Dunlavey

These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshotsare my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.

这些都是很好的答案。冒着被指责不回答问题的风险，我也会看其他 80% 的时间。Stackshots是我最喜欢的查找值得优化代码的方法，尤其是当您发现并不是绝对需要的函数调用时。

C语言找出最少 3 个数字的最快方法？

提问by Sudhanshu

采纳答案by bdonlan

回答by Stephen Canon

回答by Nils Pipenbrinck

回答by ephemient

回答by Mark Ransom

回答by Ken

回答by DigitalRoss

回答by Paul Sasik

回答by Hamish Grubijan

回答by Mike Dunlavey

相关推荐

最近更新

标签

C语言 找出最少 3 个数字的最快方法？

提问by Sudhanshu

采纳答案by bdonlan

回答by Stephen Canon

回答by Nils Pipenbrinck

回答by ephemient

回答by Mark Ransom

回答by Ken

回答by DigitalRoss

回答by Paul Sasik

回答by Hamish Grubijan

回答by Mike Dunlavey

相关推荐

C语言 如何解决致命错误 LNK1000：IncrBuildImage 期间的内部错误？

C语言 #define 的数组格式（C 预处理器）

C语言 C 中的 fgets() 函数

C语言 为什么需要左值作为增量操作数错误？

相关推荐

最近更新

标签

C语言找出最少 3 个数字的最快方法？

C语言如何解决致命错误 LNK1000：IncrBuildImage 期间的内部错误？

C语言为什么需要左值作为增量操作数错误？