C++ 设置的最低有效位的位置

Question

提问by peterchen

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.

我正在寻找一种有效的方法来确定设置在整数中的最低有效位的位置，例如对于 0x0FF0，它将是 4。

A trivial implementation is this:

一个简单的实现是这样的：

unsigned GetLowestBitPos(unsigned value)
{
   assert(value != 0); // handled separately

   unsigned pos = 0;
   while (!(value & 1))
   {
      value >>= 1;
      ++pos;
   }
   return pos;
}

Any ideas how to squeeze some cycles out of it?

任何想法如何从中挤出一些周期？

(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)

（注意：这个问题是针对喜欢这些东西的人，而不是让人们告诉我 xyzoptimization 是邪恶的。）

[edit]Thanks everyone for the ideas! I've learnt a few other things, too. Cool!

[编辑]感谢大家的想法！我也学到了一些其他的东西。凉爽的！

Answer 1

回答by Anton Tykhyy

Bit Twiddling Hacksoffers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is ?multiply and lookup?:

Bit Twiddling Hacks提供了一系列优秀的，呃，bittwiddlinghack，并附有性能/优化讨论。对于您的问题（来自该站点），我最喜欢的解决方案是？乘法和查找？：

unsigned int v;  // find the number of trailing zeros in 32-bit v 
int r;           // result goes here
static const int MultiplyDeBruijnBitPosition[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

Helpful references:

有用的参考资料：

"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming

“使用 de Bruijn 序列索引计算机单词中的 1” - 解释上述代码为何有效。
“ Board Representation > Bitboards > BitScan” - 这个问题的详细分析，特别关注国际象棋编程

Answer 2

回答by ephemient

Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)

为什么不使用内置的ffs？（我从 Linux 中获取了一个手册页，但它比这更广泛可用。）

ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.

ffs(3) - Linux 手册页
姓名
ffs - 找到一个字中的第一个位
概要
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
描述
ffs() 函数返回在字 i 中设置的第一个（最低有效）位的位置。最低有效位是位置 1，最高有效位是例如 32 或 64。函数 ffsll() 和 ffsl() 执行相同的操作，但采用可能不同大小的参数。
返回值
这些函数返回第一个位集的位置，如果没有在 i 中设置位，则返回 0。
符合
4.3BSD，POSIX.1-2001。
笔记
BSD 系统的原型是<string.h>.

Answer 3

回答by Mehrdad Afshari

There is an x86 assembly instruction (bsf) that will do it. :)

有一个 x86 汇编指令 ( bsf) 可以做到这一点。:)

More optimized?!

更优化？！

Side Note:

边注：

Optimization at this level is inherently architecture dependent. Today's processors are too complex(in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.

这个级别的优化本质上是依赖于架构的。今天的处理器太复杂（在分支预测、缓存未命中、流水线方面），以至于很难预测哪个代码在哪个架构上执行得更快。将操作从 32 减少到 9 或类似的事情甚至可能会降低某些架构的性能。单一架构上的优化代码可能会导致另一个架构上的代码更糟。我认为你要么针对特定的 CPU 优化它，要么保持原样，让编译器选择它认为更好的东西。

Answer 4

回答by moonshadow

Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.

大多数现代架构都会有一些指令来查找最低设置位或最高设置位的位置，或计算前导零的数量等。

If you have any one instruction of this class you can cheaply emulate the others.

如果你有这门课的任何一条指令，你可以廉价地模仿其他指令。

Take a moment to work through it on paper and realise that x & (x-1)will clear the lowest set bit in x, and ( x & ~(x-1) )will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.

花点时间在纸上完成它并意识到x & (x-1)将清除 x 中的最低设置位，并且( x & ~(x-1) )将只返回最低设置位，而不管体系结构、字长等。知道这一点，使用硬件计数领先是微不足道的-zeroes/highest-set-bit 如果没有明确的指令，可以找到最低的设置位。

If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given hereor one of the ones on the Bit Twiddling Hackspage can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.

如果根本没有相关的硬件支持，这里给出的计数前导零的乘法和查找实现或Bit Twiddling Hacks页面上的其中之一可以使用上述身份轻松转换为最低设置位，并且具有无分支的优点。

Answer 5

回答by Andrew Bainbridge

Weee, loads of solutions and not a benchmark in sight. You people should be ashamed of yourselves ;-)

Weee，大量的解决方案，而不是一个基准。你们这些人应该为自己感到羞耻 ;-)

My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.

我的机器是 Intel i530 (2.9 GHz)，运行 Windows 7 64 位。我用 32 位版本的 MinGW 编译。

$ gcc --version
gcc.exe (GCC) 4.7.2

$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop.         Time = 2.91  (Original questioner)
De Bruijn multiply. Time = 1.16  (Tykhyy)
Lookup table.       Time = 0.36  (Andrew Grant)
FFS instruction.    Time = 0.90  (ephemient)
Branch free mask.   Time = 3.48  (Dan / Jim Balter)
Double hack.        Time = 3.41  (DocMax)

$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop.         Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table.       Time = 0.35
FFS instruction.    Time = 0.68
Branch free mask.   Time = 3.49
Double hack.        Time = 0.92

My code:

我的代码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define ARRAY_SIZE 65536
#define NUM_ITERS 5000  // Number of times to process array


int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            if (value == 0)
                continue;
            unsigned pos = 0;
            while (!(value & 1))
            {
                value >>= 1;
                ++pos;
            }
            total += pos + 1;
        }
    }

    return total;
}


int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
    static const int MultiplyDeBruijnBitPosition[32] = 
    {
       1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9, 
       32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
    };

    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned int c = nums[i];
            total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
        }
    }

    return total;
}


unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
    unsigned mask = 1;
    for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
        if (num & mask) {
            return cnt;
        }
    }

    return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned int value = nums[i];
            // note that order to check indices will depend whether you are on a big 
            // or little endian machine. This is for little-endian
            unsigned char *bytes = (unsigned char *)&value;
            if (bytes[0])
                total += lowestBitTable[bytes[0]];
            else if (bytes[1])
              total += lowestBitTable[bytes[1]] + 8;
            else if (bytes[2])
              total += lowestBitTable[bytes[2]] + 16;
            else
              total += lowestBitTable[bytes[3]] + 24;
        }
    }

    return total;
}


int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            total +=  __builtin_ffs(nums[i]);
        }
    }

    return total;
}


int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            int i16 = !(value & 0xffff) << 4;
            value >>= i16;

            int i8 = !(value & 0xff) << 3;
            value >>= i8;

            int i4 = !(value & 0xf) << 2;
            value >>= i4;

            int i2 = !(value & 0x3) << 1;
            value >>= i2;

            int i1 = !(value & 0x1);

            int i0 = (value >> i1) & 1? 0 : -32;

            total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
        }
    }

    return total;
}


int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
    int total = 0; // Prevent compiler from optimizing out the code
    for (int j = 0; j < NUM_ITERS; j++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            unsigned value = nums[i];
            double d = value ^ (value - !!value); 
            total += (((int*)&d)[1]>>20)-1022; 
        }
    }

    return total;
}


int main() {
    unsigned nums[ARRAY_SIZE];
    for (int i = 0; i < ARRAY_SIZE; i++) {
        nums[i] = rand() + (rand() << 15);
    }

    for (int i = 0; i < 256; i++) {
        lowestBitTable[i] = get_lowest_set_bit(i);
    }


    clock_t start_time, end_time;
    int result;

    start_time = clock();
    result = find_first_bits_naive_loop(nums);
    end_time = clock();
    printf("Naive loop.         Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_de_bruijn(nums);
    end_time = clock();
    printf("De Bruijn multiply. Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_lookup_table(nums);
    end_time = clock();
    printf("Lookup table.       Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_ffs_instruction(nums);
    end_time = clock();
    printf("FFS instruction.    Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_branch_free_mask(nums);
    end_time = clock();
    printf("Branch free mask.   Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);

    start_time = clock();
    result = find_first_bits_double_hack(nums);
    end_time = clock();
    printf("Double hack.        Time = %.2f, result = %d\n", 
        (end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}

Answer 6

回答by Andrew Grant

The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.

对此的最快（非内在/非汇编）解决方案是找到最低字节，然后在 256 个条目的查找表中使用该字节。这为您提供了四个条件指令的最坏情况性能和 1 的最佳情况。这不仅是最少数量的指令，而且是最少数量的分支，这在现代硬件上非常重要。

Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.

您的表（256 个 8 位条目）应包含 0-255 范围内每个数字的 LSB 索引。您检查值的每个字节并找到最低的非零字节，然后使用此值查找实际索引。

This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,

这确实需要 256 字节的内存，但如果此功能的速度如此重要，那么 256 字节是非常值得的，

E.g.

例如

byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};

unsigned GetLowestBitPos(unsigned value)
{
  // note that order to check indices will depend whether you are on a big 
  // or little endian machine. This is for little-endian
  byte* bytes = (byte*)value;
  if (bytes[0])
    return lowestBitTable[bytes[0]];
  else if (bytes[1])
      return lowestBitTable[bytes[1]] + 8;
  else if (bytes[2])
      return lowestBitTable[bytes[2]] + 16;
  else
      return lowestBitTable[bytes[3]] + 24;  
}

Answer 7

回答by Dan

OMG has this just spiraled.

天哪，这只是螺旋式上升。

What most of these examples are lacking is a little understanding about how all hardware works.

这些示例中的大多数都缺乏对所有硬件如何工作的一点了解。

Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.

任何时候你有一个分支，CPU 都必须猜测将采用哪个分支。指令管道加载了引导到猜测路径的指令。如果 CPU 猜错了，则指令管道将被刷新，并且必须加载另一个分支。

Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.

考虑顶部的简单 while 循环。猜测将是留在循环内。当它离开循环时至少会出错一次。这将刷新指令管道。这种行为比猜测它会离开循环要好一些，在这种情况下，它会在每次迭代时刷新指令管道。

The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.

不同类型的处理器之间丢失的 CPU 周期数量差异很大。但是您可以预期会丢失 20 到 150 个 CPU 周期。

The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.

下一个更糟糕的组是您认为通过将值分成更小的部分并添加更多分支来节省几次迭代。这些分支中的每一个都增加了刷新指令管道的额外机会，并花费了另外 20 到 150 个时钟周期。

Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.

让我们考虑在表中查找值时会发生什么。很可能该值当前不在缓存中，至少不是第一次调用您的函数时。这意味着从缓存加载值时 CPU 会停止运行。同样，这因一台机器而异。新的 Intel 芯片实际上以此为契机，在当前线程等待缓存加载完成时交换线程。这很容易比指令管道刷新更昂贵，但是如果您多次执行此操作，它很可能只发生一次。

Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.

显然，最快的恒定时间解决方案是一种涉及确定性数学的解决方案。一个纯粹而优雅的解决方案。

My apologies if this was already covered.

如果这已经被涵盖，我很抱歉。

Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.

我使用的每个编译器，除了 XCODE AFAIK，都有用于前向位扫描和反向位扫描的编译器内在函数。这些将在大多数硬件上编译为单个汇编指令，没有缓存未命中，没有分支未命中预测，也没有其他程序员生成的绊脚石。

For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.

对于 Microsoft 编译器，请使用 _BitScanForward 和 _BitScanReverse。
对于 GCC，使用 __builtin_ffs、__builtin_clz、__builtin_ctz。

Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.

此外，如果您对所讨论的主题没有足够的了解，请不要发布答案和可能误导新人。

Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:

抱歉，我完全忘记提供解决方案了。这是我在 IPAD 上使用的代码，它没有针对该任务的汇编级指令：

unsigned BitScanLow_BranchFree(unsigned value)
{
    bool bwl = (value & 0x0000ffff) == 0;
    unsigned I1 = (bwl * 15);
    value = (value >> I1) & 0x0000ffff;

    bool bbl = (value & 0x00ff00ff) == 0;
    unsigned I2 = (bbl * 7);
    value = (value >> I2) & 0x00ff00ff;

    bool bnl = (value & 0x0f0f0f0f) == 0;
    unsigned I3 = (bnl * 3);
    value = (value >> I3) & 0x0f0f0f0f;

    bool bsl = (value & 0x33333333) == 0;
    unsigned I4 = (bsl * 1);
    value = (value >> I4) & 0x33333333;

    unsigned result = value + I1 + I2 + I3 + I4 - 1;

    return result;
}

The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.

这里要理解的是，不是比较昂贵，而是比较之后发生的分支。在这种情况下，比较被强制为 0 或 1 的值，其中 .. == 0，并且结果用于组合可能发生在分支任一侧的数学运算。

Edit:

编辑：

The code above is totally broken. This code works and is still branch-free (if optimized):

上面的代码完全被破坏了。此代码有效并且仍然是无分支的（如果优化）：

int BitScanLow_BranchFree(ui value)
{
    int i16 = !(value & 0xffff) << 4;
    value >>= i16;

    int i8 = !(value & 0xff) << 3;
    value >>= i8;

    int i4 = !(value & 0xf) << 2;
    value >>= i4;

    int i2 = !(value & 0x3) << 1;
    value >>= i2;

    int i1 = !(value & 0x1);

    int i0 = (value >> i1) & 1? 0 : -32;

    return i16 + i8 + i4 + i2 + i1 + i0;
}

This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.

如果给定 0，则返回 -1。如果您不关心 0 或乐于为 0 获得 31，请删除 i0 计算，从而节省大量时间。

Answer 8

回答by DocMax

Inspired by this similar postthat involves searching for a set bit, I offer the following:

受这篇涉及搜索设置位的类似帖子的启发，我提供以下内容：

unsigned GetLowestBitPos(unsigned value)
{
   double d = value ^ (value - !!value); 
   return (((int*)&d)[1]>>20)-1023; 
}

Pros:

优点：

no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code

没有循环
没有分支
在恒定时间内运行
通过返回否则越界结果来处理 value=0
只有两行代码

Cons:

缺点：

assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)

假设编码的字节序很小（可以通过更改常量来修复）
假设 double 是一个真正的*8 IEEE 浮点数 (IEEE 754)

Update:As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:

更新：正如评论中所指出的，联合是一个更清晰的实现（至少对于 C 而言）并且看起来像：

unsigned GetLowestBitPos(unsigned value)
{
    union {
        int i[2];
        double d;
    } temp = { .d = value ^ (value - !!value) };
    return (temp.i[1] >> 20) - 1023;
}

This assumes 32-bit ints with little-endian storage for everything (think x86 processors).

这假设 32 位整数和小端存储的所有内容（想想 x86 处理器）。

Answer 9

回答by Brian R. Bondy

It can be done with a worst case of less than 32 operations:

它可以通过少于 32 次操作的最坏情况来完成：

Principle:Checking for 2 or more bits is just as efficient as checking for 1 bit.

原理：检查 2 位或更多位与检查 1 位一样有效。

So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.

因此，例如没有什么可以阻止您首先检查哪个分组，然后检查该组中从最小到最大的每一位。

So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...

所以......
如果你一次检查 2 位，你在最坏的情况下 (Nbits/2) + 1 检查总数。
如果您一次检查 3 位，则在最坏的情况下 (Nbits/3) + 总共 2 次检查。
...

Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.

最好的方法是检查 4 组。在最坏的情况下，这需要 11 次操作而不是 32 次。

The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.

如果您使用这种分组思想，最好的情况是从您的算法的 1 次检查到 2 次检查。但是，最好的情况下额外的 1 次检查对于最坏的情况来说是值得的。

Note: I write it out in full instead of using a loop because it's more efficient that way.

注意：我把它完整地写出来而不是使用循环，因为这样更有效。

int getLowestBitPos(unsigned int value)
{
    //Group 1: Bits 0-3
    if(value&0xf)
    {
        if(value&0x1)
            return 0;
        else if(value&0x2)
            return 1;
        else if(value&0x4)
            return 2;
        else
            return 3;
    }

    //Group 2: Bits 4-7
    if(value&0xf0)
    {
        if(value&0x10)
            return 4;
        else if(value&0x20)
            return 5;
        else if(value&0x40)
            return 6;
        else
            return 7;
    }

    //Group 3: Bits 8-11
    if(value&0xf00)
    {
        if(value&0x100)
            return 8;
        else if(value&0x200)
            return 9;
        else if(value&0x400)
            return 10;
        else
            return 11;
    }

    //Group 4: Bits 12-15
    if(value&0xf000)
    {
        if(value&0x1000)
            return 12;
        else if(value&0x2000)
            return 13;
        else if(value&0x4000)
            return 14;
        else
            return 15;
    }

    //Group 5: Bits 16-19
    if(value&0xf0000)
    {
        if(value&0x10000)
            return 16;
        else if(value&0x20000)
            return 17;
        else if(value&0x40000)
            return 18;
        else
            return 19;
    }

    //Group 6: Bits 20-23
    if(value&0xf00000)
    {
        if(value&0x100000)
            return 20;
        else if(value&0x200000)
            return 21;
        else if(value&0x400000)
            return 22;
        else
            return 23;
    }

    //Group 7: Bits 24-27
    if(value&0xf000000)
    {
        if(value&0x1000000)
            return 24;
        else if(value&0x2000000)
            return 25;
        else if(value&0x4000000)
            return 26;
        else
            return 27;
    }

    //Group 8: Bits 28-31
    if(value&0xf0000000)
    {
        if(value&0x10000000)
            return 28;
        else if(value&0x20000000)
            return 29;
        else if(value&0x40000000)
            return 30;
        else
            return 31;
    }

    return -1;
}

Answer 10

回答by soulmerge

Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):

为什么不使用二分查找？这将始终在 5 次操作后完成（假设 int 大小为 4 个字节）：

if (0x0000FFFF & value) {
    if (0x000000FF & value) {
        if (0x0000000F & value) {
            if (0x00000003 & value) {
                if (0x00000001 & value) {
                    return 1;
                } else {
                    return 2;
                }
            } else {
                if (0x0000004 & value) {
                    return 3;
                } else {
                    return 4;
                }
            }
        } else { ...
    } else { ...
} else { ...

C++ 设置的最低有效位的位置

提问by peterchen

回答by Anton Tykhyy

回答by ephemient

ffs(3) - Linux man page

Name

Synopsis

Description

Return Value

Conforming to

Notes

ffs(3) - Linux 手册页

姓名

概要

描述

返回值

符合

笔记

回答by Mehrdad Afshari

Side Note:

边注：

回答by moonshadow

回答by Andrew Bainbridge

回答by Andrew Grant

回答by Dan

回答by DocMax

回答by Brian R. Bondy

回答by soulmerge

相关推荐

最近更新

标签

C++ 设置的最低有效位的位置

提问by peterchen

回答by Anton Tykhyy

回答by ephemient

ffs(3) - Linux man page

Name

Synopsis

Description

Return Value

Conforming to

Notes

ffs(3) - Linux 手册页

姓名

概要

描述

返回值

符合

笔记

回答by Mehrdad Afshari

Side Note:

边注：

回答by moonshadow

回答by Andrew Bainbridge

回答by Andrew Grant

回答by Dan

回答by DocMax

回答by Brian R. Bondy

回答by soulmerge

相关推荐

如何在 C++ 中声明原子向量

在 C++ 中访问静态类变量？

C++ 如何从字符串中找到子字符串？

C++ 通过 Cin 获得十六进制

相关推荐

最近更新

标签