C++ 获取 sqrt(n) 整数部分的最快方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4930307/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to get the integer part of sqrt(n)?
提问by Nawaz
As we know if n
is not a perfect square, then sqrt(n)
would not be an integer. Since I need only the integer part, I feel that calling sqrt(n)
wouldn't be that fast, as it takes time to calculate the fractional part also.
众所周知,如果n
不是完全平方数,则sqrt(n)
不会是整数。由于我只需要整数部分,我觉得调用sqrt(n)
不会那么快,因为计算小数部分也需要时间。
So my question is,
所以我的问题是,
Can we get only the integer part of sqrt(n)without calculating the actual value of sqrt(n)
? The algorithm should be faster than sqrt(n)
(defined in <math.h>
or <cmath>
)?
我们可以只得到sqrt(n)的整数部分而不计算 的实际值sqrt(n)
吗?该算法应该比sqrt(n)
(在<math.h>
或 中定义<cmath>
)更快?
If possible, you can write the code in asm
block also.
如果可能,您也可以在asm
块中编写代码。
回答by Matthieu M.
I would try the Fast Inverse Square Roottrick.
我会尝试快速逆平方根技巧。
It's a way to get a very good approximation of 1/sqrt(n)
without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).
这是一种在1/sqrt(n)
没有任何分支的情况下获得非常好的近似值的方法,基于一些位操作,因此不可移植(特别是在 32 位和 64 位平台之间)。
Once you get it, you just need to inverse the result, and takes the integer part.
一旦你得到它,你只需要反转结果,并取整数部分。
There might be faster tricks, of course, since this one is a bit of a round about.
当然,可能有更快的技巧,因为这个技巧有点绕。
EDIT: let's do it!
编辑:让我们做吧!
First a little helper:
先来个小帮手:
// benchmark.h
#include <sys/time.h>
template <typename Func>
double benchmark(Func f, size_t iterations)
{
f();
timeval a, b;
gettimeofday(&a, 0);
for (; iterations --> 0;)
{
f();
}
gettimeofday(&b, 0);
return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
(a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}
Then the main body:
然后是主体:
#include <iostream>
#include <cmath>
#include "benchmark.h"
class Sqrt
{
public:
Sqrt(int n): _number(n) {}
int operator()() const
{
double d = _number;
return static_cast<int>(std::sqrt(d) + 0.5);
}
private:
int _number;
};
// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
IntSqrt(int n): _number(n) {}
int operator()() const
{
int remainder = _number;
if (remainder < 0) { return 0; }
int place = 1 <<(sizeof(int)*8 -2);
while (place > remainder) { place /= 4; }
int root = 0;
while (place)
{
if (remainder >= root + place)
{
remainder -= root + place;
root += place*2;
}
root /= 2;
place /= 4;
}
return root;
}
private:
int _number;
};
// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
FastSqrt(int n): _number(n) {}
int operator()() const
{
float number = _number;
float x2 = number * 0.5F;
float y = number;
long i = *(long*)&y;
//i = (long)0x5fe6ec85e7de30da - (i >> 1);
i = 0x5f3759df - (i >> 1);
y = *(float*)&i;
y = y * (1.5F - (x2*y*y));
y = y * (1.5F - (x2*y*y)); // let's be precise
return static_cast<int>(1/y + 0.5f);
}
private:
int _number;
};
int main(int argc, char* argv[])
{
if (argc != 3) {
std::cerr << "Usage: %prog integer iterations\n";
return 1;
}
int n = atoi(argv[1]);
int it = atoi(argv[2]);
assert(Sqrt(n)() == IntSqrt(n)() &&
Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";
double time = benchmark(Sqrt(n), it);
double intTime = benchmark(IntSqrt(n), it);
double fastTime = benchmark(FastSqrt(n), it);
std::cout << "Number iterations: " << it << "\n"
"Sqrt computation : " << time << "\n"
"Int computation : " << intTime << "\n"
"Fast computation : " << fastTime << "\n";
return 0;
}
And the results:
结果:
sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation : 217
Fast computation : 119
// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation : 313
Fast computation : 119
Where as expected the Fastcomputation performs much better than the Intcomputation.
正如预期的那样,Fast计算的性能比Int计算要好得多。
Oh, and by the way, sqrt
is faster :)
哦,顺便说一下,sqrt
速度更快:)
回答by orlp
Edit: this answer is foolish - use (int) sqrt(i)
编辑:这个答案是愚蠢的 - 使用 (int) sqrt(i)
After profiling with propersettings (-march=native -m64 -O3
) the above was a lotfaster.
与剖析后适当设置(-march=native -m64 -O3
)上面是一个很大更快。
Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.
好吧,一个有点老的问题,但“最快”的答案还没有给出。最快的(我认为)是二进制平方根算法,在这篇 Embedded.com 文章中进行了充分解释。
It basicly comes down to this:
它基本上归结为:
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
int root = 0;
int i;
for (i = 0; i < 16; i++) {
root <<= 1;
rem <<= 2;
rem += a >> 30;
a <<= 2;
if (root < rem) {
root++;
rem -= root;
root++;
}
}
return (unsigned short) (root >> 1);
}
On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i)
took 2750 ms. Using (unsigned short) sqrt((float) i)
took 3600ms. This was done using g++ -O3
. Using the -ffast-math
compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.
在我的机器(Q6600,Ubuntu 10.10)上,我通过取数字 1-100000000 的平方根来进行分析。使用iqsrt(i)
耗时 2750 毫秒。使用(unsigned short) sqrt((float) i)
耗时 3600 毫秒。这是使用g++ -O3
. 使用-ffast-math
编译选项,时间分别为 2100 毫秒和 3100 毫秒。请注意,这甚至没有使用一行汇编程序,因此它可能仍然要快得多。
The above code works for both C and C++ and with minor syntax changes also for Java.
上面的代码适用于 C 和 C++,对 Java 也有细微的语法变化。
What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:
在有限范围内效果更好的是二进制搜索。在我的机器上,这会将上面的版本从水中吹出 4 倍。遗憾的是,它的范围非常有限:
#include <stdint.h>
const uint16_t squares[] = {
0, 1, 4, 9,
16, 25, 36, 49,
64, 81, 100, 121,
144, 169, 196, 225,
256, 289, 324, 361,
400, 441, 484, 529,
576, 625, 676, 729,
784, 841, 900, 961,
1024, 1089, 1156, 1225,
1296, 1369, 1444, 1521,
1600, 1681, 1764, 1849,
1936, 2025, 2116, 2209,
2304, 2401, 2500, 2601,
2704, 2809, 2916, 3025,
3136, 3249, 3364, 3481,
3600, 3721, 3844, 3969,
4096, 4225, 4356, 4489,
4624, 4761, 4900, 5041,
5184, 5329, 5476, 5625,
5776, 5929, 6084, 6241,
6400, 6561, 6724, 6889,
7056, 7225, 7396, 7569,
7744, 7921, 8100, 8281,
8464, 8649, 8836, 9025,
9216, 9409, 9604, 9801,
10000, 10201, 10404, 10609,
10816, 11025, 11236, 11449,
11664, 11881, 12100, 12321,
12544, 12769, 12996, 13225,
13456, 13689, 13924, 14161,
14400, 14641, 14884, 15129,
15376, 15625, 15876, 16129,
16384, 16641, 16900, 17161,
17424, 17689, 17956, 18225,
18496, 18769, 19044, 19321,
19600, 19881, 20164, 20449,
20736, 21025, 21316, 21609,
21904, 22201, 22500, 22801,
23104, 23409, 23716, 24025,
24336, 24649, 24964, 25281,
25600, 25921, 26244, 26569,
26896, 27225, 27556, 27889,
28224, 28561, 28900, 29241,
29584, 29929, 30276, 30625,
30976, 31329, 31684, 32041,
32400, 32761, 33124, 33489,
33856, 34225, 34596, 34969,
35344, 35721, 36100, 36481,
36864, 37249, 37636, 38025,
38416, 38809, 39204, 39601,
40000, 40401, 40804, 41209,
41616, 42025, 42436, 42849,
43264, 43681, 44100, 44521,
44944, 45369, 45796, 46225,
46656, 47089, 47524, 47961,
48400, 48841, 49284, 49729,
50176, 50625, 51076, 51529,
51984, 52441, 52900, 53361,
53824, 54289, 54756, 55225,
55696, 56169, 56644, 57121,
57600, 58081, 58564, 59049,
59536, 60025, 60516, 61009,
61504, 62001, 62500, 63001,
63504, 64009, 64516, 65025
};
inline int isqrt(uint16_t x) {
const uint16_t *p = squares;
if (p[128] <= x) p += 128;
if (p[ 64] <= x) p += 64;
if (p[ 32] <= x) p += 32;
if (p[ 16] <= x) p += 16;
if (p[ 8] <= x) p += 8;
if (p[ 4] <= x) p += 4;
if (p[ 2] <= x) p += 2;
if (p[ 1] <= x) p += 1;
return p - squares;
}
A 32 bit version can be downloaded here: https://gist.github.com/3481770
32 位版本可以在这里下载:https: //gist.github.com/3481770
回答by R.. GitHub STOP HELPING ICE
While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):
虽然我怀疑您可以通过搜索“快速整数平方根”找到很多选项,但这里有一些可能有效的潜在新想法(每个想法都是独立的,或者您可以将它们组合起来):
- Make a
static const
array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root. - Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).
static const
在您想要支持的域中创建一个包含所有完美平方的数组,并对其执行快速无分支二分搜索。数组中的结果索引是平方根。- 将数字转换为浮点数并将其分解为尾数和指数。将指数减半并将尾数乘以一些魔法因子(你的工作就是找到它)。这应该能够为您提供非常接近的近似值。如果不准确,请包括最后一步进行调整(或将其用作上述二分搜索的起点)。
回答by Shmo
If you don't mind an approximation, how about this integer sqrt function I cobbled together.
如果您不介意近似值,那么我拼凑的这个整数 sqrt 函数怎么样。
int sqrti(int x)
{
union { float f; int x; } v;
// convert to float
v.f = (float)x;
// fast aprox sqrt
// assumes float is in IEEE 754 single precision format
// assumes int is 32 bits
// b = exponent bias
// m = number of mantissa bits
v.x -= 1 << 23; // subtract 2^m
v.x >>= 1; // divide by 2
v.x += 1 << 29; // add ((b + 1) / 2) * 2^m
// convert to int
return (int)v.f;
}
It uses the algorithm described in this Wikipediaarticle. On my machine it's almost twice as fast as sqrt :)
它使用这篇维基百科文章中描述的算法。在我的机器上它几乎是 sqrt 的两倍 :)
回答by Andrew Tomazos
To do integer sqrt you can use this specialization of newtons method:
要执行整数 sqrt,您可以使用牛顿方法的这种专业化:
Def isqrt(N):
a = 1
b = N
while |a-b| > 1
b = N / a
a = (a + b) / 2
return a
Basically for any x the sqrt lies in the range (x ... N/x), so we just bisect that interval at every loop for the new guess. Sort of like binary search but it converges must faster.
基本上对于任何 x,sqrt 都在 (x ... N/x) 范围内,所以我们只是在每个循环中将该间隔平分以获得新的猜测。有点像二分查找,但它收敛得更快。
This converges in O(loglog(N)) which is very fast. It also doesn't use floating point at all, and it will also work well for arbitrary precision integers.
这收敛于非常快的 O(loglog(N))。它也根本不使用浮点数,它也适用于任意精度的整数。
回答by flybot1
Why nobody suggests the quickest method?
为什么没有人建议最快的方法?
If:
如果:
- the range of numbers is limited
- memory consumption is not crucial
- application launch time is not critical
- 数字范围有限
- 内存消耗并不重要
- 应用程序启动时间并不重要
then create int[MAX_X]
filled (on launch) with sqrt(x)
(you don't need to use the function sqrt()
for it).
然后创建int[MAX_X]
填充(启动时)sqrt(x)
(您不需要为此使用该功能sqrt()
)。
All these conditions fit my program quite well.
Particularly, an int[10000000]
array is going to consume 40MB
.
所有这些条件都非常适合我的程序。特别是,int[10000000]
数组将消耗40MB
.
What's your thoughts on this?
你对此有何看法?
回答by MCCCS
This is so short that it 99% inlines:
这太短了,它 99% 内联:
static inline int sqrtn(int num) {
int i;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t" // clean xmm0 for cvtsi2ss
"cvtsi2ss %1, %%xmm0\n\t" // convert num to float, put it to xmm0
"sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
"cvttss2si %%xmm0, %0" // float to int
:"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
return i;
}
Why clean xmm0
? Documentation of cvtsi2ss
为什么要干净xmm0
?文件cvtsi2ss
The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.
目标操作数是 XMM 寄存器。结果存储在目标操作数的低位双字中,高位三个双字保持不变。
GCC Intrinsic version (runs only on GCC):
GCC 内在版本(仅在 GCC 上运行):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v4sf xmm0 = {0, 0, 0, 0};
xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
xmm0 = __builtin_ia32_sqrtss(xmm0);
return __builtin_ia32_cvttss2si(xmm0);
}
Intel Intrinsic version (tested on GCC, Clang, ICC):
Intel Intrinsic 版本(在 GCC、Clang、ICC 上测试):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __m128 xmm0 = _mm_setzero_ps();
xmm0 = _mm_cvt_si2ss(xmm0, num);
xmm0 = _mm_sqrt_ss(xmm0);
return _mm_cvtt_ss2si(xmm0);
}
^^^^ All of them require SSE 1 (not even SSE 2).
^^^^ 所有这些都需要 SSE 1(甚至不需要 SSE 2)。
回答by Andrey
In many cases, even exact integer sqrt value is not needed, enough having good approximation of it. (For example, it often happens in DSP optimization, when 32-bit signal should be compressed to 16-bit, or 16-bit to 8-bit, without loosing much precision around zero).
在许多情况下,甚至不需要精确的整数 sqrt 值,只要有一个很好的近似值就足够了。(例如,它经常发生在 DSP 优化中,当 32 位信号应该被压缩到 16 位,或 16 位到 8 位,而不会在零附近失去很多精度)。
I've found this useful equation:
我发现了这个有用的等式:
k = ceil(MSB(n)/2); - MSB(n) is the most significant bit of "n"
sqrt(n) ~= 2^(k-2)+(2^(k-1))*n/(2^(2*k))); - all multiplications and divisions here are very DSP-friendly, as they are only 2^k.
This equation generates smooth curve (n, sqrt(n)), its values are not very much different from real sqrt(n) and thus can be useful when approximate accuracy is enough.
该方程生成平滑曲线 (n, sqrt(n)),其值与实际 sqrt(n) 相差不大,因此在近似精度足够时很有用。
回答by DanielW
On my computer with gcc, with -ffast-math, converting a 32-bit integer to float and using sqrtf takes 1.2 s per 10^9 ops (without -ffast-math it takes 3.54 s).
在我的带有 gcc 的计算机上,使用 -ffast-math,将 32 位整数转换为浮点数并使用 sqrtf 每 10^9 个操作需要 1.2 秒(没有 -ffast-math 需要 3.54 秒)。
The following algorithm uses 0.87 s per 10^9 at the expense of some accuracy: errors can be as much as -7 or +1 although the RMS error is only 0.79:
以下算法每 10^9 使用 0.87 秒,但会牺牲一些精度:尽管 RMS 误差仅为 0.79,但误差可能高达 -7 或 +1:
uint16_t SQRTTAB[65536];
inline uint16_t approxsqrt(uint32_t x) {
const uint32_t m1 = 0xff000000;
const uint32_t m2 = 0x00ff0000;
if (x&m1) {
return SQRTTAB[x>>16];
} else if (x&m2) {
return SQRTTAB[x>>8]>>4;
} else {
return SQRTTAB[x]>>8;
}
}
The table is constructed using:
该表是使用以下方法构建的:
void maketable() {
for (int x=0; x<65536; x++) {
double v = x/65535.0;
v = sqrt(v);
int y = int(v*65535.0+0.999);
SQRTTAB[x] = y;
}
}
I found that refining the bisection using further if statements does improve accuracy, but it also slows things down to the point that sqrtf is faster, at least with -ffast-math.
我发现使用进一步的 if 语句细化二分确实提高了准确性,但它也会减慢速度,使 sqrtf 更快,至少使用 -ffast-math。
回答by Benoit Thiery
If you need performance on computing square root, I guess you will compute a lot of them. Then why not caching the answer? I don't know the range for N in your case, nor if you will compute many times the square root of the same integer, but if yes, then you can cache the result each time your method is called (in an array would be the most efficient if not too large).
如果您需要计算平方根的性能,我想您会计算很多。那为什么不缓存答案呢?我不知道在你的情况下 N 的范围,也不知道你是否会多次计算同一个整数的平方根,但如果是,那么你可以在每次调用你的方法时缓存结果(在数组中如果不是太大,则最有效)。