C语言 查找位数组中设置的最高有效位(最左侧)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2589096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find most significant bit (left-most) that is set in a bit array
提问by Claudiu
I have a bit array implementation where the 0th index is the MSB of the first byte in an array, the 8th index is the MSB of the second byte, etc...
我有一个位数组实现,其中第 0 个索引是数组中第一个字节的 MSB,第 8 个索引是第二个字节的 MSB,等等......
What's a fast way to find the first bit that is set in this bit array? All the related solutions I have looked up find the first least significant bit, but I need the first most significant one. So, given 0x00A1, I want 8 (since it's the 9th bit from the left).
找到在此位数组中设置的第一位的快速方法是什么?我查找的所有相关解决方案都找到了第一个最不重要的位,但我需要第一个最重要的位。所以,给定 0x00A1,我想要 8(因为它是左边的第 9 位)。
回答by Andras Vass
GCC has __builtin_clzthat translates to BSR on x86/x64, CLZ on ARM, etc. and emulates the instruction if the hardware does not implement it.
Visual C++ 2005 and up has _BitScanReverse.
GCC__builtin_clz可以在 x86/x64 上转换为 BSR,在 ARM 上转换为 CLZ 等,如果硬件没有实现它,则模拟指令。
Visual C++ 2005 及更高版本具有_BitScanReverse.
回答by johnwbyrd
tl:dr; For 32 bits, use de Bruijn multiplication.
tl:博士;对于 32 位,使用de Bruijn 乘法。
It's the "fastest"portable algorithm. It is substantially faster and more correct than all the other portable 32-bit MSB algorithms in this thread.
这是“最快”的便携式算法。它比该线程中的所有其他可移植 32 位 MSB 算法更快、更正确。
The de Bruijn algorithm also returns a correct result when the input is zero. The __builtin_clz and _BitScanReverse instructions return incorrect resultswhen the input is zero.
当输入为零时,de Bruijn 算法也会返回正确的结果。 __builtin_clz 和 _BitScanReverse 指令在输入为零时返回不正确的结果。
On Windows x86-64, de Bruijn multiplication runs at a speed comparable to the equivalent (flawed) Windows function, with a performance difference of only around 3%.
在 Windows x86-64 上,de Bruijn 乘法的运行速度与等效(有缺陷的)Windows 函数相当,性能差异仅为 3% 左右。
Here's the code.
这是代码。
u32 msbDeBruijn32( u32 v )
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MultiplyDeBruijnBitPosition[( u32 )( v * 0x07C4ACDDU ) >> 27];
}
All the other answers in this thread either run much more poorly than their authors suggest, or don't calculate the result correctly, or both. Let's benchmark them all, and let's verify that they do what they claim to do.
该线程中的所有其他答案要么运行得比作者建议的要差得多,要么没有正确计算结果,或者两者兼而有之。让我们对它们进行基准测试,并验证它们是否做到了他们声称要做的事情。
Here's a simple C++11 harness to test all these implementations. It compiles clean on Visual Studio but should work on all modern compilers. It allows you to run the benchmark in performance mode (bVerifyResults = false) and in checking mode (bVerifyResults = true).
这是一个简单的 C++11 工具来测试所有这些实现。它在 Visual Studio 上编译干净,但应该适用于所有现代编译器。它允许您在性能模式 (bVerifyResults = false) 和检查模式 (bVerifyResults = true) 下运行基准测试。
Here are the results in verification mode:
以下是验证模式下的结果:
Verification failed for msbNative64: input was 0; output was 818af060; expected 0
Verification failed for msbFfs: input was 22df; output was 0; expected d
Verification failed for msbPerformanceJunkie32: input was 0; output was ffffffff; expected 0
Verification failed for msbNative32: input was 0; output was 9ab07060; expected 0
The "performance junkie" and the Microsoft native implementations do different things when the input is zero. msbPerformanceJunkie32 produces -1, and Microsoft's _BitScanReverse produces a random number, consistent with the underlying hardware instruction. Also the msbPerformanceJunkie32 implementation produces a result that is off by one from all the other answers.
当输入为零时,“性能瘾君子”和 Microsoft 本地实现会做不同的事情。msbPerformanceJunkie32产生-1,微软的_BitScanReverse产生随机数,与底层硬件指令一致。此外, msbPerformanceJunkie32 实现产生的结果与所有其他答案相差甚远。
Here are the results in performance mode, running on my i7-4600 laptop, compiled in release mode:
以下是在我的 i7-4600 笔记本电脑上运行的性能模式下的结果,在发布模式下编译:
msbLoop64 took 2.56751 seconds
msbNative64 took 0.222197 seconds
msbLoop32 took 1.43456 seconds
msbFfs took 0.525097 seconds
msbPerformanceJunkie32 took 1.07939 seconds
msbDeBruijn32 took 0.224947 seconds
msbNative32 took 0.218275 seconds
The de Bruijn version beats the other implementations soundlybecause it is branchless, and therefore it runs well against inputs that produce an evenly distributed set of outputs. All the other versions are slower against arbitrary inputs because of the penalties of branch misprediction on modern CPUs. The smbFfs function produces incorrect results so it can be ignored.
de Bruijn 版本很好地击败了其他实现,因为它是无分支的,因此它在生成一组均匀分布的输出的输入上运行良好。由于现代 CPU 上分支错误预测的惩罚,所有其他版本对任意输入都较慢。smbFfs 函数产生不正确的结果,因此可以忽略。
Some of the implementations work on 32 bit inputs, and some work on 64 bit inputs. A template will help us compare apples to apples, regardless of the input size.
有些实现适用于 32 位输入,有些适用于 64 位输入。无论输入大小如何,模板都将帮助我们将苹果与苹果进行比较。
Here's the code. Download and run the benchmarks yourself if you like.
这是代码。如果您愿意,可以自行下载并运行基准测试。
#include <iostream>
#include <chrono>
#include <random>
#include <cassert>
#include <string>
#include <limits>
#ifdef _MSC_VER
#define MICROSOFT_COMPILER 1
#include <intrin.h>
#endif // _MSC_VER
const int iterations = 100000000;
bool bVerifyResults = false;
std::random_device rd;
std::default_random_engine re(rd());
typedef unsigned int u32;
typedef unsigned long long u64;
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() {
beg_ = clock_::now();
}
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count();
}
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
unsigned int msbPerformanceJunkie32(u32 x)
{
static const unsigned int bval[] =
{ 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4 };
unsigned int r = 0;
if (x & 0xFFFF0000) {
r += 16 / 1;
x >>= 16 / 1;
}
if (x & 0x0000FF00) {
r += 16 / 2;
x >>= 16 / 2;
}
if (x & 0x000000F0) {
r += 16 / 4;
x >>= 16 / 4;
}
return r + bval[x];
}
#define FFS(t) \
{ \
register int n = 0; \
if (!(0xffff & t)) \
n += 16; \
if (!((0xff << n) & t)) \
n += 8; \
if (!((0xf << n) & t)) \
n += 4; \
if (!((0x3 << n) & t)) \
n += 2; \
if (!((0x1 << n) & t)) \
n += 1; \
return n; \
}
unsigned int msbFfs32(u32 x)
{
FFS(x);
}
unsigned int msbLoop32(u32 x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
unsigned int msbLoop64(u64 x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
u32 msbDeBruijn32(u32 v)
{
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
return MultiplyDeBruijnBitPosition[(u32)(v * 0x07C4ACDDU) >> 27];
}
#ifdef MICROSOFT_COMPILER
u32 msbNative32(u32 val)
{
unsigned long result;
_BitScanReverse(&result, val);
return result;
}
u32 msbNative64(u64 val)
{
unsigned long result;
_BitScanReverse64(&result, val);
return result;
}
#endif // MICROSOFT_COMPILER
template <typename InputType>
void test(unsigned int msbFunc(InputType),
const std::string &name,
const std::vector< InputType > &inputs,
std::vector< unsigned int > &results,
bool bIsReference = false
)
{
if (bIsReference)
{
int i = 0;
for (int i = 0; i < iterations; i++)
results[i] = msbFunc(inputs[i]);
}
InputType result;
if (bVerifyResults)
{
bool bNotified = false;
for (int i = 0; i < iterations; i++)
{
result = msbFunc(inputs[i]);
if ((result != results[i]) && !bNotified)
{
std::cout << "Verification failed for " << name << ": "
<< "input was " << std::hex << inputs[i]
<< "; output was " << result
<< "; expected " << results[i]
<< std::endl;
bNotified = true;
}
}
}
else
{
Timer t;
for (int i = 0; i < iterations; i++)
{
result = msbFunc(inputs[i]);
}
double elapsed = t.elapsed();
if ( !bIsReference )
std::cout << name << " took " << elapsed << " seconds" << std::endl;
if (result == -1.0f)
std::cout << "this comparison only exists to keep the compiler from " <<
"optimizing out the benchmark; this branch will never be called";
}
}
void main()
{
std::uniform_int_distribution <u64> dist64(0,
std::numeric_limits< u64 >::max());
std::uniform_int_distribution <u32> shift64(0, 63);
std::vector< u64 > inputs64;
for (int i = 0; i < iterations; i++)
{
inputs64.push_back(dist64(re) >> shift64(re));
}
std::vector< u32 > results64;
results64.resize(iterations);
test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, true);
test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, false);
#ifdef MICROSOFT_COMPILER
test< u64 >(msbNative64, "msbNative64", inputs64, results64, false);
#endif // MICROSOFT_COMPILER
std::cout << std::endl;
std::uniform_int_distribution <u32> dist32(0,
std::numeric_limits< u32 >::max());
std::uniform_int_distribution <u32> shift32(0, 31);
std::vector< u32 > inputs32;
for (int i = 0; i < iterations; i++)
inputs32.push_back(dist32(re) >> shift32(re));
std::vector< u32 > results32;
results32.resize(iterations);
test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, true);
test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, false);
test< u32 >(msbFfs32, "msbFfs", inputs32, results32, false);
test< u32 >(msbPerformanceJunkie32, "msbPerformanceJunkie32",
inputs32, results32, false);
test< u32 >(msbDeBruijn32, "msbDeBruijn32", inputs32, results32, false);
#ifdef MICROSOFT_COMPILER
test< u32 >(msbNative32, "msbNative32", inputs32, results32, false);
#endif // MICROSOFT_COMPILER
}
回答by Sir Slick
As a performance junkie I have tried a ton of variations for MSB set, the following is the fastest I have come across,
作为一个性能迷,我尝试了大量 MSB 集的变化,以下是我遇到的最快的,
unsigned int msb32(unsigned int x)
{
static const unsigned int bval[] =
{0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4};
unsigned int r = 0;
if (x & 0xFFFF0000) { r += 16/1; x >>= 16/1; }
if (x & 0x0000FF00) { r += 16/2; x >>= 16/2; }
if (x & 0x000000F0) { r += 16/4; x >>= 16/4; }
return r + bval[x];
}
回答by Arkku
There are multiple ways to do this, and the relative performance of the different implementations is somewhat machine-dependent (I happen to have benchmarked this to some extent for a similar purpose). On some machines there's even a built-in instruction for this (use one if available and portability can be dealt with).
有多种方法可以做到这一点,不同实现的相对性能在某种程度上取决于机器(出于类似目的,我碰巧在某种程度上对此进行了基准测试)。在某些机器上,甚至有一个内置的指令(如果可用并且可以处理便携性,请使用一个)。
Check out some implementations here(under “integer log base 2”). If you are using GCC, check out the functions __builtin_clzand __builtin_clzl(which do this for non-zero unsigned ints and unsigned longs, respectively). The “clz” stands for “count leading zeros”, which is yet another way to describe the same problem.
在此处查看一些实现(在“整数日志基数 2”下)。如果您使用 GCC,请查看函数__builtin_clz和__builtin_clzl(分别针对非零无符号整数和无符号长整数执行此操作)。“clz”代表“计数前导零”,这是描述同一问题的另一种方式。
Of course, if your bit array does not fit into a suitable machine word, you need to iterate over words in the array to find the first non-zero word and then perform this calculation only on that word.
当然,如果您的位数组不适合合适的机器字,您需要遍历数组中的字以找到第一个非零字,然后仅对该字执行此计算。
回答by ggiroux
Look up the BSR (Bit scan reverse) x86 asm instruction for the fastest way to do this. From Intel's doc:
Searches the source operand (second operand) for the most significant set bit (1 bit).
If a most significant 1 bit is found, its bit index is stored in the destination operand
(first operand).
查找 BSR(位扫描反向)x86 asm 指令以获得最快的方法。来自英特尔的文档:
Searches the source operand (second operand) for the most significant set bit (1 bit).
If a most significant 1 bit is found, its bit index is stored in the destination operand
(first operand).
回答by Martin Beckett
回答by Mischa
If you're using x86, you can beat practically any byte-by-byte or word-by-word solution using the SSE2 operations, combined with the find-first-bit instructions, which (in the gcc world) are pronounced "ffs" for the lowest bit and "fls" for the highest bit. Pardon me for having trouble (!@#$%^) formatting "C" code in an answer; check out: http://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/
如果您使用的是 x86,您几乎可以使用 SSE2 操作击败任何逐字节或逐字的解决方案,并结合 find-first-bit 指令,这些指令(在 gcc 世界中)发音为“ffs " 代表最低位, "fls" 代表最高位。请原谅我在答案中格式化“C”代码时遇到麻烦(!@#$%^);退房:http: //mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/
回答by David C. Rankin
I have worked with a number of functions to get the most significant bit, but problems generally arise moving between 32 and 64 bit numbers or moving between x86_64 and x86 boxes. The functions __builtin_clz, __builtin_clzland __builtin_clzllwork well for 32/64 bit numbers and across x86_64 and x86 machines. However, three functions are required. I have found a simple MSB that relies on right-shift that will handle all cases for positive numbers. At least for the use I make of it, it has succeeded where others have failed:
我使用了许多函数来获取最重要的位,但是在 32 和 64 位数字之间移动或在 x86_64 和 x86 框之间移动时通常会出现问题。函数__builtin_clz, __builtin_clzland__builtin_clzll适用于 32/64 位数字以及跨 x86_64 和 x86 机器。但是,需要三个功能。我发现了一个简单的 MSB,它依赖于右移,可以处理正数的所有情况。至少在我使用它时,它在其他人失败的地方取得了成功:
int
getmsb (unsigned long long x)
{
int r = 0;
if (x < 1) return 0;
while (x >>= 1) r++;
return r;
}
By designating input as unsigned long longit can handle all number classes from unsigned charto unsigned long longand given the standard definition, it is compatible across x86_64 and x86 builds. The case for 0is defined to return 0, but can be changed as required. A simple test and output are:
通过指定输入,因为unsigned long long它可以处理从unsigned chartounsigned long long和给定标准定义的所有数字类,它在 x86_64 和 x86 构建之间兼容。的情况0定义为 return 0,但可以根据需要进行更改。一个简单的测试和输出是:
int
main (int argc, char *argv[]) {
unsigned char c0 = 0;
unsigned char c = 216;
unsigned short s = 1021;
unsigned int ui = 32768;
unsigned long ul = 3297381253;
unsigned long long ull = 323543844043;
int i = 32767;
printf (" %16u MSB : %d\n", c0, getmsb (c0));
printf (" %16u MSB : %d\n", c, getmsb (c));
printf (" %16u MSB : %d\n", s, getmsb (s));
printf (" %16u MSB : %d\n", i, getmsb (i));
printf (" %16u MSB : %d\n", ui, getmsb (ui));
printf (" %16lu MSB : %d\n", ul, getmsb (ul));
printf (" %16llu MSB : %d\n", ull, getmsb (ull));
return 0;
}
Output:
输出:
0 MSB : 0
216 MSB : 7
1021 MSB : 9
32767 MSB : 14
32768 MSB : 15
3297381253 MSB : 31
323543844043 MSB : 38
NOTE:for speed considerations, using a single function to accomplish the same thing centered around __builtin_clzllis still faster by a factor of about 6.
注意:出于速度方面的考虑,使用单个函数完成以中心为中心的相同操作的__builtin_clzll速度仍然要快 6 倍左右。
回答by Richard wicks
I'll add one!
我加一个!
typedef unsigned long long u64;
typedef unsigned int u32;
typedef unsigned char u8;
u8 findMostSignificantBit (u64 u64Val)
{
u8 u8Shift;
u8 u8Bit = 0;
assert (u64Val != 0ULL);
for (u8Shift = 32 ; u8Shift != 0 ; u8Shift >>= 1)
{
u64 u64Temp = u64Val >> u8Shift;
if (u64Temp)
{
u8Bit |= u8Shift; // notice not using +=
u64Val = u64Temp;
}
}
return u8Bit;
}
Of course, this is working on a 64 bit number (unsigned long long), and not an array. Also, plenty of people have pointed to inbuilt g++ functions I was not aware of. How interesting.
当然,这适用于 64 位数字(unsigned long long),而不是数组。此外,很多人都指出了我不知道的内置 g++ 函数。很有意思。
Anyhow, this finds the most significant bit in 6 iterations and gives an assert if you passed 0 to the function. Not the best function to use if you have access to an instruction of the chipset.
无论如何,这会在 6 次迭代中找到最高有效位,并在您将 0 传递给函数时给出断言。如果您可以访问芯片组的说明,这不是最好的功能。
I also am also using |= instead of += because these are always powers of two, and OR is (classically) faster than addition. Since I'm only adding unique powers of 2 together, I never have roll over.
我也使用 |= 而不是 += 因为它们总是 2 的幂,并且 OR(通常)比加法快。由于我只是将 2 的唯一幂相加,因此我从未翻转过。
This is a binary search which means it always finds the result in 6 iterations.
这是一个二分搜索,这意味着它总是在 6 次迭代中找到结果。
Again, this is better:
同样,这更好:
u8 findMostSignificantBit2 (u64 u64Val)
{
assert (u64Val != 0ULL);
return (u8) (__builtin_ctzll(u64Val));
}
回答by Peter Cordes
x86 has a BSR instruction that returns a bit-index (rather than the count of leading zeros aboveit).
x86 有一个 BSR 指令,它返回一个位索引(而不是它上面的前导零的计数)。
But unfortunately there's no portable intrinsic that efficientlyexposes it for all compilers. GNU C provides __builtin_clz, but unsigned bitidx = 31 - __builtin_clz(x);doesn't optimize back to just BSR with current GCC and ICC. (It does with clang, which proves that the expression is equivalent so it could).
但不幸的是,没有可移植的内在函数可以有效地为所有编译器公开它。GNU C 提供了__builtin_clz,但unsigned bitidx = 31 - __builtin_clz(x);没有优化回仅使用当前 GCC 和 ICC 的 BSR。(它与 clang 一样,这证明表达式是等价的,所以它可以)。
The following defines BSR32()and BSR64()macros or functions that compile efficiently to justa bsrinstruction on x86. (Producing a garbage result if the input was zero. There's no way with intrinsics to take advantage of the asm instruction's behaviour of leaving the destination unmodified for input=0.)
以下定义BSR32()和BSR64()宏或功能有效编译为只是一个bsr在x86指令。(如果输入为零,则产生垃圾结果。内在函数无法利用 asm 指令的行为,即在 input=0 时不修改目标。)
Portability to non-x86 would take some additional #ifdefe.g. to fall back to 31-__builtin_clz. Most non-x86 ISAs, if they have a leading-zero bitscan at all, count leading zeros instead of giving you the bit-index. That's why GNU C defines __builtin_clzas the portable builtin. (If there's no HW support on the target system, the builtin will compile to software emulation, usually calling a libgcc helper function.)
对非 x86 的可移植性需要一些额外的时间,#ifdef例如回退到31-__builtin_clz. 大多数非 x86 ISA,如果它们根本具有前导零位扫描,则计算前导零而不是为您提供位索引。这就是 GNU C 定义__builtin_clz为可移植内置函数的原因。(如果目标系统上没有硬件支持,内置程序将编译为软件仿真,通常调用 libgcc 辅助函数。)
#include <stdint.h>
// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
#ifdef __INTEL_COMPILER
typedef unsigned int bsr_idx_t;
#else
#include <intrin.h> // MSVC
typedef unsigned long bsr_idx_t;
#endif
static inline
unsigned BSR32(unsigned long x){
bsr_idx_t idx;
_BitScanReverse(&idx, x); // ignore bool retval
return idx;
}
static inline
unsigned BSR64(uint64_t x) {
bsr_idx_t idx;
_BitScanReverse64(&idx, x); // ignore bool retval
return idx;
}
#elif defined(__GNUC__)
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
bsfprobably doesn't need as much help for compilers, because the builtin matches the asm instruction's behaviour of returning the bit-index of the LSB, i.e. the count of trailing zeros.
bsf编译器可能不需要那么多帮助,因为内置函数匹配 asm 指令返回 LSB 位索引的行为,即尾随零的计数。
A test caller unsigned test32(unsigned x) { return BSR32(x); }inlines it to 1 instruction on all the major x86 compilers, on the Godbolt compiler explorer. BSR64 inlines the same way, to a 64-bit operand-size version. See also Is there an x86/x86_64 instruction which zeros all bits below the Most Significant Bit?for example use-cases.
在 Godbolt 编译器资源管理unsigned test32(unsigned x) { return BSR32(x); }器上,测试调用者将其内联到所有主要 x86 编译器上的1 条指令。BSR64 以相同的方式内联到 64 位操作数大小版本。另请参阅是否有 x86/x86_64 指令将最高有效位以下的所有位归零?例如用例。
;; x64 MSVC 19.16 -O2
unsigned int test32(unsigned int) PROC ; test32, COMDAT
bsr eax, ecx
ret 0
unsigned int test32(unsigned int) ENDP ; test32
# clang -O3 -march=haswell is too "smart?" for its own good:
test32(unsigned int):
lzcnt eax, edi
xor eax, 31
ret
# gcc8.2 -O3 -march=haswell
test32(unsigned int):
bsr eax, edi
ret
# ICC19 -O3 -march=haswell
test32(unsigned int):
bsr eax, edi #15.9
ret #41.12
The point of this is to avoid slow code from the portable (to non-MSVC) version:
这样做的目的是为了避免可移植(到非 MSVC)版本的代码缓慢:
#ifdef __GNUC__
unsigned badgcc(uint64_t x) {
return 63 - __builtin_clzll(x);
}
#endif
Without -march=haswellwe get just BSR from clang, but:
没有-march=haswell我们只从 clang 得到 BSR,但是:
# gcc8.2 -O3
badgcc(unsigned long):
bsr rdi, rdi
mov eax, 63
xor rdi, 63
sub eax, edi
ret
# ICC19.0.1 -O3
badgcc(unsigned long):
mov rax, -1 #46.17
bsr rdx, rdi #46.17
cmove rdx, rax #46.17
neg rdx #46.17
add rdx, 63 #46.17
neg edx #46.17
add edx, 63 #46.17
mov eax, edx #46.17
ret #46.17
That's just nasty. (Interesting to see that ICC is doing a CMOV to produce -1if the input is zero. BSR sets ZF according to its input, unlike most instructions which set flags according to the result.)
那太恶心了。(有趣的是,-1如果输入为零,ICC 正在执行 CMOV 以生成。BSR 根据其输入设置 ZF ,这与大多数根据结果设置标志的指令不同。)
With -march=haswell(or otherwise enabling use of BMI1 instructions), it's not as bad, but still not as good as just BSR. Modulo output dependencies, which compilers mostly work to avoid for lzcnt but strangely not for BSR. (Where the output dependency is a truedependency, because of the input=0 behaviour.) Why does breaking the "output dependency" of LZCNT matter?
使用-march=haswell(或以其他方式启用使用 BMI1 指令),它没有那么糟糕,但仍然不如 BSR 好。模输出依赖性,编译器主要为 lzcnt 避免这种依赖,但奇怪的是,对于 BSR 则不然。(因为 input=0 的行为,输出依赖是真正的依赖。) 为什么打破 LZCNT 的“输出依赖”很重要?

