在 C/C++ 中获得正模的最快方法

Question

提问by Nathaniel

Often in my inner loops I need to index an array in a "wrap-around" way, so that (for example) if the array size is 100 and my code asks for element -2, it should be given element 98. In many high level languages such as Python, one can do this simply with my_array[index % array_size], but for some reason C's integer arithmetic (usually) rounds toward zero instead of consistently rounding down, and consequently its modulo operator returns a negative result when given a negative first argument.

通常在我的内部循环中，我需要以“环绕”方式对数组进行索引，因此（例如）如果数组大小为 100 并且我的代码要求元素 -2，则应为其指定元素 98。在许多高级语言（例如 Python）可以简单地使用来完成此操作my_array[index % array_size]，但出于某种原因，C 的整数算术（通常）向零舍入而不是始终向下舍入，因此当给定第一个参数为负时，其模运算符返回负结果。

Often I know that indexwill not be less than -array_size, and in these cases I just do my_array[(index + array_size) % array_size]. However, sometimes this can't be guaranteed, and for those cases I would like to know the fastest way to implement an always-positive modulo function. There are several "clever" ways to do it without branching, such as

通常我知道index不会少于-array_size，而在这些情况下我只是这样做my_array[(index + array_size) % array_size]。但是，有时无法保证这一点，对于这些情况，我想知道实现始终为正的模函数的最快方法。有几种“聪明”的方法可以在不分支的情况下做到这一点，例如

inline int positive_modulo(int i, int n) {
    return (n + (i % n)) % n;
}

or

或者

inline int positive_modulo(int i, int n) {
    return (i % n) + (n * (i < 0));
}

Of course I can profile these to find out which is the fastest on my system, but I can't help worrying that I might have missed a better one, or that what's fast on my machine might be slow on a different one.

当然，我可以对这些进行分析以找出哪个是我系统上最快的，但我不禁担心我可能错过了一个更好的，或者我的机器上的速度快的东西在另一台机器上可能会很慢。

So is there a standard way to do this, or some clever trick that I've missed that's likely to be the fastest possible way?

那么有没有一种标准的方法可以做到这一点，或者我错过了一些可能是最快的方法的聪明技巧？

Also, I know it's probably wishful thinking, but if there's a way of doing this that can be auto-vectorised, that would be amazing.

另外，我知道这可能是一厢情愿的想法，但如果有一种可以自动矢量化的方法，那就太棒了。

Answer 1

采纳答案by Jorge Bellon

Most of the time, compilers are very good at optimizing your code, so it is usually best to keep your code readable (for both compilers and other developers to know what you are doing).

大多数时候，编译器非常擅长优化您的代码，因此通常最好保持您的代码可读（让编译器和其他开发人员都知道您在做什么）。

Since your array size is always positive, I suggest you to define the quotient as unsigned. The compiler will optimize small if/else blocks into conditional instructions which have no branches:

由于您的数组大小始终为正，因此我建议您将商定义为unsigned。编译器会将小的 if/else 块优化为没有分支的条件指令：

unsigned modulo( int value, unsigned m) {
    int mod = value % (int)m;
    if (value < 0) {
        mod += m;
    }
    return mod;
}

This creates a very small function without branches:

这将创建一个没有分支的非常小的函数：

modulo(int, unsigned int):
        mov     eax, edi
        cdq
        idiv    esi
        add     esi, edx
        mov     eax, edx
        test    edi, edi
        cmovs   eax, esi
        ret

For example modulo(-5, 7)returns 2.

例如modulo(-5, 7)返回2。

Unfortunately, since the quotient is not known they must perform an integer division, which is a bit slow compared to other integer operations. If you know the sizes of your array are power of two, I recommend keeping these function definitions in a header, so that the compiler can optimize them into a more efficient function. Here is the function unsigned modulo256(int v) { return modulo(v,256); }:

不幸的是，由于商不知道它们必须执行整数除法，这与其他整数运算相比有点慢。如果您知道数组的大小是 2 的幂，我建议将这些函数定义保存在头文件中，以便编译器可以将它们优化为更高效的函数。这是功能unsigned modulo256(int v) { return modulo(v,256); }：

modulo256(int):                          # @modulo256(int)
        mov     edx, edi
        sar     edx, 31
        shr     edx, 24
        lea     eax, [rdi+rdx]
        movzx   eax, al
        sub     eax, edx
        test    edi, edi
        lea     edx, [rax+256]
        cmovs   eax, edx
        ret

See assembly: https://gcc.godbolt.org/z/DG7jMw

见大会：https: //gcc.godbolt.org/z/DG7jMw

See comparison with most voted answer: http://quick-bench.com/oJbVwLr9G5HJb0oRaYpQOCec4E4

查看与投票最多的答案的比较：http: //quick-bench.com/oJbVwLr9G5HJb0oRaYpQOCec4E4

Edit: turns out Clang is able to generate a function without any conditional move instructions (which cost more than regular arithmetic operations). This difference is completely negligible in the general case due to the fact that the integral division takes around 70% of the total time.

编辑：原来 Clang 能够在没有任何条件移动指令的情况下生成一个函数（这比常规算术运算成本更高）。由于积分除法大约占总时间的 70%，因此这种差异在一般情况下完全可以忽略不计。

Basically, Clang shifts valueright to extend its sign bit to the whole width of m(that is 0xffffffffwhen negative and 0otherwise) which is used to mask the second operand in mod + m.

基本上，锵移位value右到其符号位扩展到的整个宽度m（即0xffffffff，当负和0其他），其用于掩蔽在第二个操作数mod + m。

unsigned modulo (int value, unsigned m) {
    int mod = value % (int)m;
    m &= value >> std::numeric_limits<int>::digits;
    return mod + m;
}

Answer 2

回答by Martin B

The standard way I learned is

我学习的标准方法是

inline int positive_modulo(int i, int n) {
    return (i % n + n) % n;
}

This function is essentially your first variant without the abs(which, in fact, makes it return the wrong result). I wouldn't be surprised if an optimizing compiler could recognize this pattern and compile it to machine code that computes an "unsigned modulo".

这个函数本质上是你没有的第一个变体abs（实际上，这使它返回错误的结果）。如果优化编译器可以识别这种模式并将其编译为计算“无符号模”的机器代码，我不会感到惊讶。

Edit:

编辑：

Moving on to your second variant: First of all, it contains a bug, too -- the n < 0should be i < 0.

继续讨论您的第二个变体：首先，它也包含一个错误 -n < 0应该是i < 0.

This variant may not look as if it branches, but on a lot of architectures, the i < 0will compile into a conditional jump. In any case, it will be at least as fast to replace (n * (i < 0))with i < 0? n: 0, which avoids the multiplication; in addition, it's "cleaner" because it avoids reinterpreting the bool as an int.

这个变体可能看起来不像分支，但在很多架构上，它i < 0会编译成条件跳转。在任何情况下，这将是至少一样快，以取代(n * (i < 0))用i < 0? n: 0，这避免了乘法; 此外，它“更干净”，因为它避免将 bool 重新解释为 int。

As to which of these two variants is faster, that probably depends on the compiler and processor architecture -- time the two variants and see. I don't think there's a faster way than either of these two variants, though.

至于这两个变体中的哪个更快，这可能取决于编译器和处理器架构——对这两个变体计时并查看。不过，我认为没有比这两种变体中任何一种更快的方法。

Answer 3

回答by nneonneo

Modulo a power of two, the following works (assuming twos complement representation):

模二的幂，以下作品（假设二进制补码表示）：

return i & (n-1);

Answer 4

回答by jthill

An old-school way to get the optional addend using twos-complement sign-bit propagation:

使用二进制补码符号位传播获取可选加数的老式方法：

int positive_mod(int i, int n)
{
    /* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
    int m = i%n;
    return m+ (m>>shift & n);
}

Answer 5

回答by Kyle Butt

If you want to avoid all conditional paths (including the conditional move generated above, (For example if you need this code to vectorize, or to run in constant time), You can use the sign bit as a mask:

如果您想避免所有条件路径（包括上面生成的条件移动，（例如，如果您需要此代码进行矢量化，或在恒定时间内运行），您可以使用符号位作为掩码：

unsigned modulo(int value, unsigned m) {
  int shift_width = sizeof(int) * 8 - 1;
  int mod = (value % (int) m);
  mod += ((value >> shift_width) & m);
  return mod;
}

You can check on godboltthat this generates the same number of instructions, but they are interleaved (better pipelining)

您可以检查Godbolt是否生成相同数量的指令，但它们是交错的（更好的流水线）

Here are the quickbench resultsYou can see that on gcc it's equal or better in every case. For clang it's the same speed in the generic case, because clang generatesthe branch free code in the generic case. The technique is useful regardless, because the compiler can't always be relied on to produce the particular optimization, and you may have to roll it by hand for vector code.

这是quickbench 结果你可以看到在 gcc 上它在每种情况下都相等或更好。对于 clang，它在通用情况下的速度相同，因为在通用情况下，clang生成无分支代码。无论如何，该技术都是有用的，因为不能总是依赖编译器来生成特定的优化，并且您可能必须手动滚动向量代码。

Answer 6

回答by SkYWAGz

Your second example is better than the first. A multiplication is a more complex operation than an if/else operation, so use this:

你的第二个例子比第一个好。乘法是一个比 if/else 操作更复杂的操作，所以使用这个：

inline int positive_modulo(int i, int n) {
    int tmp = i % n;
    return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

Answer 7

回答by user15006

If you can afford to promote to a larger type (and do your modulo on the larger type), this code does a single modulo and no if:

如果您有能力升级到更大的类型（并在更大的类型上进行取模），则此代码执行单个取模，如果：

int32_t positive_modulo(int32_t number, int32_t modulo) {
    return (number + ((int64_t)modulo << 32)) % modulo;
}

Answer 8

回答by chux - Reinstate Monica

Fastest way to get a positive modulo in C/C++

在 C/C++ 中获得正模的最快方法

The following fast? - maybe not as fast as others, yet is simple and functionally correct for all¹a,b-- unlike others.

以下快吗？- 可能不像其他人那么快，但对于所有^{1 来说}都是简单且功能正确的a,b- 与其他人不同。

int modulo_Euclidean(int a, int b) {
  int m = a % b;
  if (m < 0) {
    // m += (b < 0) ? -b : b; // avoid this form: it is UB when b == INT_MIN
    m = (b < 0) ? m - b : m + b;
  }
  return m;
}

Various other answers have mod(a,b)weaknesses especially when b < 0.

其他各种答案都有mod(a,b)弱点，尤其是当b < 0.

See Euclidean divisionfor ideas about b < 0

参见欧几里得分裂的想法b < 0

inline int positive_modulo(int i, int n) {
    return (i % n + n) % n;
}

Fails when i % n + noverflows (think large i, n) - Undefined behavior.

i % n + n溢出时失败（想想大i, n） - 未定义的行为。

return i & (n-1);

Relies on nas a power of two. (Fair that the answer does mention this.)

依赖n为二的幂。（公平的答案确实提到了这一点。）

int positive_mod(int i, int n)
{
    /* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
    int m = i%n;
    return m+ (m>>shift & n);
}

Often fails when n < 0. e, g, positive_mod(-2,-3) --> -5

时经常失败n < 0。e, g,positive_mod(-2,-3) --> -5

int32_t positive_modulo(int32_t number, int32_t modulo) {
    return (number + ((int64_t)modulo << 32)) % modulo;
}

Obliges using 2 integer widths. (Fair that the answer does mention this.)
Fails with modulo < 0. positive_modulo(2, -3)--> -1.

使用 2 个整数宽度的义务。（公平的答案确实提到了这一点。）
失败modulo < 0。 positive_modulo(2, -3)--> -1。

inline int positive_modulo(int i, int n) {
    int tmp = i % n;
    return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

Often fails when n < 0. e, g, positive_modulo(-2,-3) --> -5

时经常失败n < 0。e, g,positive_modulo(-2,-3) --> -5

¹Exceptions: In C, a%bis not defined when a/boverflows as in a/0or INT_MIN/-1.

¹例外：在 C 中，a%b当a/b溢出时未定义为 ina/0或INT_MIN/-1。

Answer 9

回答by Aki Suihkonen

You can as well do array[(i+array_size*N) % array_size], where N is large enough integer to guarantee positive argument, but small enough for not to overflow.

您也可以这样做array[(i+array_size*N) % array_size]，其中 N 是足够大的整数以保证正参数，但又小到不会溢出。

When the array_size is constant, there are techniques to calculate the modulus without division. Besides of power of two approach, one can calculate a weighted sum of bitgroups multiplied by the 2^i % n, where i is the least significant bit in each group:

当 array_size 是常数时，有一些技术可以不除法计算模数。除了两种方法的幂之外，还可以计算位组的加权和乘以 2^i % n，其中 i 是每组中的最低有效位：

e.g. 32-bit integer 0xaabbccdd % 100 = dd + cc*[2]56 + bb*[655]36 + aa*[167772]16, having the maximum range of (1+56+36+16)*255 = 27795. With repeated applications and different subdivision one can reduce the operation to few conditional subtractions.

例如 32 位整数 0xaabbccdd % 100 = dd + cc*[2]56 + bb*[655]36 + aa*[167772]16，最大范围为 (1+56+36+16)*255 = 27795 . 通过重复应用和不同的细分，可以将操作减少到少数有条件的减法。

Common practises also include approximation of division with reciprocal of 2^32 / n, which usually can handle reasonably large range of arguments.

常见的做法还包括对倒数为 2^32 / n 的除法近似，这通常可以处理相当大范围的参数。

 i - ((i * 655)>>16)*100; // (gives 100*n % 100 == 100 requiring adjusting...)

在 C/C++ 中获得正模的最快方法

提问by Nathaniel

采纳答案by Jorge Bellon

回答by Martin B

回答by nneonneo

回答by jthill

回答by Kyle Butt

回答by SkYWAGz

回答by user15006

回答by chux - Reinstate Monica

回答by Aki Suihkonen

相关推荐

最近更新

标签

在 C/C++ 中获得正模的最快方法

提问by Nathaniel

采纳答案by Jorge Bellon

回答by Martin B

回答by nneonneo

回答by jthill

回答by Kyle Butt

回答by SkYWAGz

回答by user15006

回答by chux - Reinstate Monica

回答by Aki Suihkonen

相关推荐

C++ 翻转图像以获得镜像效果

C ++如何比较2个整数以查看它们是否相等？一个是用户输入，另一个是结构的一部分？

C++ 在 std::map 中使用两个键的最佳方法是什么？

C++ 将 int 转换为 wstring

相关推荐

最近更新

标签