C++ 如何加速浮点到整数的转换？

Question

提问by Serge

We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this

我们在项目中进行了大量的浮点到整数转换。基本上是这样的

for(int i = 0; i < HUGE_NUMBER; i++)
     int_array[i] = float_array[i];

The default C function which performs the conversion turns out to be quite time consuming.

执行转换的默认 C 函数非常耗时。

Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.

是否有任何解决方法（可能是手动调整功能）可以稍微加快该过程？我们不太关心精度。

Answer 1

采纳答案by Larry Gritz

Most of the other answers here just try to eliminate loop overhead.

这里的大多数其他答案只是试图消除循环开销。

Only deft_code's answergets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.

只有deft_code 的回答触及了真正问题的核心——在 x86 处理器上将浮点数转换为整数的成本高得惊人。deft_code 的解决方案是正确的，尽管他没有给出引用或解释。

Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU

这是技巧的来源，有一些解释以及特定于您是要向上、向下还是向零舍入的版本：了解您的 FPU

Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.

很抱歉提供一个链接，但这里写的任何东西，除了复制那篇优秀的文章，都不会使事情变得清晰。

Answer 2

回答by deft_code

inline int float2int( double d )
{
   union Cast
   {
      double d;
      long l;
    };
   volatile Cast c;
   c.d = d + 6755399441055744.0;
   return c.l;
}

// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
   d += 6755399441055744.0;
   return reinterpret_cast<int&>(d);
}

for(int i = 0; i < HUGE_NUMBER; i++)
     int_array[i] = float2int(float_array[i]);

The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).

双参数不是错误！有办法直接用浮点数来做这个技巧，但是试图覆盖所有的角落情况会变得丑陋。在当前形式中，如果您想要截断，则此函数会将浮点数四舍五入到最接近的整数，而不是使用 6755399441055743.5（少 0.5）。

Answer 3

回答by Crashworks

I ran some testson different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalarinstructions which are twice as fast as even the magic-number technique.

我对进行浮点到整数转换的不同方式进行了一些测试。简短的回答是假设您的客户拥有支持 SSE2 的 CPU 并设置 /arch:SSE2 编译器标志。这将允许编译器使用比幻数技术快两倍的 SSE标量指令。

Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.

否则，如果您要研磨一长串浮点数，请使用 SSE2 打包操作。

Answer 4

回答by Matt Schmidt

There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.

SSE3 指令集中有一条 FISTTP 指令可以满足您的需求，但至于是否可以利用它并产生比 libc 更快的结果，我不知道。

Answer 5

回答by Martin York

Is the time large enough that it outweighs the cost of starting a couple of threads?

时间是否足以超过启动几个线程的成本？

Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.

假设您的机器上有一个或多个可以利用的多核处理器，那么跨多个线程并行化将是一项微不足道的任务。

Answer 6

回答by Crashworks

The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32()if my memory serves me correctly.

关键是要避免 _ftol() 函数，它不必要地慢。对于像这样的长数据列表，最好的选择是使用 SSE2 指令 cvtps2dq 将两个压缩浮点数转换为两个压缩 int64。这样做两次（在两个 SSE 寄存器中获得四个 int64），您可以将它们混洗在一起以获得四个 int32（丢失每个转换结果的前 32 位）。你不需要组装来做到这一点；MSVC 将编译器内在函数暴露给相关指令——_mm_cvtpd_epi32()如果我没记错的话。

If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteenfloats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):

如果你这样做，你的 float 和 int 数组必须是 16 字节对齐的，这样 SSE2 加载/存储内部函数才能以最高效率工作，这一点非常重要。另外，我建议您使用软件管道并在每个循环中一次处理16 个浮点数，例如（假设这里的“函数”实际上是对编译器内部函数的调用）：

for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
   __m128 a = sse_load4(float_array+i+0);
   __m128 b = sse_load4(float_array+i+4);
   __m128 c = sse_load4(float_array+i+8);
   __m128 d = sse_load4(float_array+i+12);
   a = sse_convert4(a);
   b = sse_convert4(b);
   c = sse_convert4(c);
   d = sse_convert4(d);
   sse_write4(int_array+i+0, a);
   sse_write4(int_array+i+4, b);
   sse_write4(int_array+i+8, c);
   sse_write4(int_array+i+12, d);
}

The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)

这样做的原因是 SSE 指令有很长的延迟，因此如果您立即将负载加载到 xmm0 并在 xmm0 上执行相关操作，那么您将遇到停顿。同时拥有多个寄存器“运行中”会稍微隐藏延迟。（理论上，一个神奇的无所不知的编译器可以绕过这个问题，但实际上它不会。）

Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fistinstead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.

如果此 SSE juju 失败，您可以向 MSVC 提供 /QIfist 选项，这将导致它发出单个操作码fist而不是调用 _ftol；这意味着它将简单地使用碰巧在 CPU 中设置的任何舍入模式，而无需确保它是 ANSI C 的特定截断操作。Microsoft 文档说 /QIfist 已被弃用，因为他们的浮点代码现在速度很快，但是反汇编程序会告诉您这是不合理的乐观。甚至 /fp:fast 也只会导致对 _ftol_sse2 的调用，尽管它比令人震惊的 _ftol 快，但它仍然是一个函数调用，后跟潜在的 SSE 操作，因此不必要地慢。

I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.

顺便说一句，我假设您使用的是 x86 架构——如果您使用的是 PPC，则有等效的 VMX 操作，或者您可以使用上面提到的幻数乘法技巧，后跟一个 vsel（以掩盖非尾数位）和对齐的存储。

Answer 7

回答by Mark Ransom

See this Intel article for speeding up integer conversions:

有关加速整数转换的信息，请参阅这篇 Intel 文章：

http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/

According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.

根据 Microsoft 的说法，VS 2005 中不推荐使用 /QIfist 编译器选项，因为整数转换已被加速。他们忽略了它是如何加速的，但查看反汇编列表可能会提供线索。

http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx

Answer 8

回答by starmole

most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior. for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding. read up on the fld and fistp fpu instructions and the fpu control word.

大多数 c 编译器生成对 _ftol 或每个浮点到 int 转换的调用。放置一个减少的浮点一致性开关（如 fp:fast）可能会有所帮助 - 如果您理解并接受此开关的其他效果。除此之外，如果你没问题并且理解不同的舍入行为，那么把它放在一个紧密的组装或 sse 内在循环中。对于像您的示例这样的大型循环，您应该编写一个函数，该函数设置一次浮点控制字，然后仅使用 fistp 指令进行批量舍入，然后重置控制字 - 如果您可以使用仅 x86 的代码路径，但至少你不会改变舍入。阅读 fld 和 fistp fpu 指令以及 fpu 控制字。

Answer 9

回答by FryGuy

You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.

您可以使用一些神奇的汇编代码将所有整数加载到处理器的 SSE 模块中，然后执行等效代码将值设置为整数，然后将它们作为浮点数读取。我不确定这会更快。我不是 SSE 大师，所以我不知道该怎么做。也许其他人可以插话。

Answer 10

回答by FryGuy

In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).

在 Visual C++ 2008 中，编译器会自行生成 SSE2 调用，如果您使用最大化优化选项进行发布版本并查看反汇编（尽管必须满足某些条件，请使用您的代码）。

C++ 如何加速浮点到整数的转换？

提问by Serge

采纳答案by Larry Gritz

回答by deft_code

回答by Crashworks

回答by Matt Schmidt

回答by Martin York

回答by Crashworks

回答by Mark Ransom

回答by starmole

回答by FryGuy

回答by FryGuy

相关推荐

最近更新

标签

C++ 如何加速浮点到整数的转换？

提问by Serge

采纳答案by Larry Gritz

回答by deft_code

回答by Crashworks

回答by Matt Schmidt

回答by Martin York

回答by Crashworks

回答by Mark Ransom

回答by starmole

回答by FryGuy

回答by FryGuy

相关推荐

C++ 读取应用程序的清单文件？

C++ 将字符串初始化为 null 与空字符串

在 C++/Qt（充当服务器）中创建简单的 WebService，提供 JSON 数据

C++ 从向量中提取子向量的最佳方法？

相关推荐

最近更新

标签