C++ Cortex A9 NEON 与 VFP 使用混淆

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7269946/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 16:43:14  来源:igfitidea点击:

Cortex A9 NEON vs VFP usage confusion

c++cfloating-pointarmneon

提问by celavek

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.

我正在尝试为 Cortex A9 ARM 处理器(更具体地说是 OMAP4)构建一个库,但我对在浮点运算和 SIMD 的上下文中何时使用 NEON 与 VFP 有点困惑. 需要注意的是,我知道 2 个硬件协处理器单元之间的区别(也在这里概述了 SO),我只是对它们的正确使用有一些误解。

Related to this I'm using the following compilation flags:

与此相关,我使用以下编译标志:

GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.

我已经通读了 ARM 文档、很多 wiki(比如这个)、论坛和博客文章,每个人似乎都同意使用 NEON 比使用 VFP 或至少混合 NEON 更好(例如使用内在函数来实现一些算法在 SIMD 中)和 VFP 不是一个好主意;我还不能 100% 确定这是否适用于整个应用程序\库的上下文或仅适用于代码中的特定位置(函数)。

So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea. Any advice in this area will be greatly appreciated !!

所以我使用霓虹灯作为我的应用程序的 FPU,因为我也想使用内在函数。因此,我遇到了一些麻烦,我对如何在 Cortex A9 上最好地使用这些功能(NEON 与 VFP)的困惑只会进一步加深而不是清理。我有一些代码可以对我的应用程序进行基准测试,并使用一些定制的计时器类,其中计算基于双精度浮点。使用 NEON 作为 FPU 会给出完全不合适的结果(尝试打印这些值会导致主要打印 inf 和 NaN;相同的代码在为 x86 构建时可以顺利运行)。所以我改变了我的计算以使用单精度浮点,正如记录的那样 NEON 不处理双精度浮点. 我的基准测试仍然没有给出正确的结果(最糟糕的是现在它不再在 x86 上工作;我认为这是因为精度下降,但我不确定)。所以我几乎完全迷失了:一方面我想将 NEON 用于 SIMD 功能并将其用作 FPU 不能提供正确的结果,另一方面将它与 VFP 混合似乎不是一个好主意。在这方面的任何建议将不胜感激!

I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:

我在上面提到的 wiki 的文章中找到了在 NEON 上下文中应该为浮点优化做些什么的总结:

"

  • Only use single precision floating point
  • Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
  • Minimize Conditional Branches
  • Enable RunFast mode
  • 只使用单精度浮点数
  • 当您发现瓶颈 FP 函数时,请使用 NEON 内在函数/ASM。你可以比编译器做得更好。
  • 最小化条件分支
  • 启用 RunFast 模式

For softfp:

对于 softfp:

  • Inline floating point code (unless its very large)
  • Pass FP arguments via pointers instead of by value and do integer work in between function calls.
  • 内联浮点代码(除非它非常大)
  • 通过指针而不是按值传递 FP 参数,并在函数调用之间进行整数工作。

"

I cannot use hard for the float ABI as I cannot link with the libraries I have available. Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.

我不能对浮动 ABI 使用 hard,因为我无法链接到我可用的库。大多数建议对我来说都有意义(除了“runfast 模式”,我不完全理解它应该做什么,以及此时我可以比编译器做得更好的事实)但我一直得到不一致的结果和我现在什么都不确定。

Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

任何人都可以说明如何正确使用浮点数和 NEON 用于 Cortex A9/A8 以及我应该使用哪些编译标志?

回答by unixsmurf

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.

我认为这个问题应该分成几个,添加一些代码示例并详细说明目标平台和使用的工具链版本。

But to cover one part of confusion: The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.

但是为了掩盖混乱的一部分:“使用 NEON 作为 FPU”的建议听起来像是一种误解。NEON 是 SIMD 引擎,VFP 是 FPU。您可以将 NEON 用于最多 4 个并行单精度值的单精度浮点运算,这(如果可能)有利于提高性能。

-mfpu=neoncan be seen as shorthand for -mfpu=neon-vfpv3.

-mfpu=neon可以看作是 的简写-mfpu=neon-vfpv3

See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.htmlfor more information.

有关更多信息,请参阅http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

回答by jww

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

...论坛和博客文章,每个人似乎都同意使用 NEON 比使用 VFP 更好,或者至少混合使用 NEON(例如使用内在函数在 SIMD 中实现一些算法),而 VFP 并不是一个好主意

I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

我不确定这是正确的。根据 ARM 在介绍 NEON 开发文章 | 霓虹灯寄存器

The NEON register bank consists of 32 64-bit registers. If both Advanced SIMD and VFPv3 are implemented, they share this register bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that supports 32 double-precision floating-point registers. This integration simplifies implementing context switching support, because the same routines that save and restore VFP context also save and restore NEON context.

The NEON unit can view the same register bank as:

  • sixteen 128-bit quadword registers, Q0-Q15
  • thirty-two 64-bit doubleword registers, D0-D31.

The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure 1.3 shows the different views of the shared NEON and VFP register bank. All of these views are accessible at any time. Software does not have to explicitly switch between them, because the instruction used determines the appropriate view.

NEON 寄存器组由 32 个 64 位寄存器组成。如果同时实现了高级 SIMD 和 VFPv3,它们将共享此寄存器组。在这种情况下,VFPv3 以支持 32 个双精度浮点寄存器的 VFPv3-D32 形式实现。这种集成简化了上下文切换支持的实现,因为保存和恢复 VFP 上下文的相同例程也保存和恢复 NEON 上下文。

NEON 单元可以查看与以下相同的寄存器组:

  • 十六个 128 位四字寄存器,Q0-Q15
  • 三十二个 64 位双字寄存器,D0-D31。

NEON D0-D31 寄存器与 VFPv3 D0-D31 寄存器相同,每个 Q0-Q15 寄存器映射到一对 D 寄存器。图 1.3 显示了共享 NEON 和 VFP 寄存器组的不同视图。所有这些视图都可以随时访问。软件不必在它们之间显式切换,因为所使用的指令决定了适当的视图。

The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.

寄存器不竞争;相反,它们作为注册银行的观点共存。没有办法摆脱 NEON 和 FPU 设备。



Related to this I'm using the following compilation flags:

-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp

与此相关,我使用以下编译标志:

-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp

Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

这就是我所做的;你的旅费可能会改变。它源自从平台和编译器收集的信息的混搭。

gnueabihftells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfpbecause its compatible with hard floats.

gnueabihf告诉我平台使用硬浮点数,这可以加快程序调用。如果有疑问,请使用,softfp因为它与硬浮点数兼容。

BeagleBone Black:

BeagleBone 黑色:

$ gcc -v 2>&1 | grep Target          
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
model name  : ARMv7 Processor rev 2 (v7l)
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
...

So the BeagleBone uses:

所以 BeagleBone 使用:

-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard

CubieTruck v5:

CubieTruck v5

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 5 (v7l)
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 

So the CubieTruck uses:

所以 CubieTruck 使用:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Banana Pi Pro:

香蕉派专业版

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 4 (v7l)
Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt

So the Banana Pi uses:

所以香蕉派使用:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Raspberry Pi 3:

树莓派 3

The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM orAarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

RPI3 的独特之处在于它的 ARMv8,但它运行的是 32 位操作系统。这意味着它实际上是 32 位 ARMAarch32。32 位 ARM 与 Aarch32 相比还有一点,但这将向您展示 Aarch32 标志

Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

此外,RPI3 使用 Broadcom A53 SoC,它具有 NEON 和可选的 CRC32 指令,但缺少可选的加密扩展。

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo 
model name  : ARMv7 Processor rev 4 (v7l)
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...

So the Raspberry Pi can use:

所以树莓派可以使用:

-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard

Or it can use (I don't know what to use for -mtune):

或者它可以使用(我不知道用于什么-mtune):

-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard 

ODROID C2:

ODROID C2:

ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

ODROID C2 使用 Amlogic A53 SoC,但它使用 64 位操作系统。ODROID C2,它有 NEON 和可选的 CRC32 指令,但缺少可选的加密扩展(类似于 RPI3 的配置)。

$ gcc -v 2>&1 | grep Target 
Target: aarch64-linux-gnu

$ cat /proc/cpuinfo 
Features    : fp asimd evtstrm crc32

So the ODROID uses:

所以 ODROID 使用:

-march=armv8-a+crc -mtune=cortex-a53


In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:

在上面的秘籍中,我通过查看数据表了解了 ARM 处理器(如 Cortex A9 或 A53)。根据Unix 和 Linux Stack Exchange上的这个答案,它破译了以下输出/proc/cpuinfo

CPU part: Part number. 0xd03 indicates Cortex-A53 processor.

CPU 部件:部件号。0xd03 表示 Cortex-A53 处理器。

So we may be able to lookup the value form a database. I don't know if it exists or where its located.

所以我们可以从数据库中查找值。我不知道它是否存在或它位于何处。

回答by Jake 'Alquimista' LEE

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.

我会远离 VFP。这就像 Thmub 模式:它是为编译器设计的。对它们进行优化毫无意义。

It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.

这听起来可能很粗鲁,但我也真的看不出 NEON 内在函数有任何意义。这比帮助更麻烦——如果有的话。

Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.

只需花两三天时间学习基本的 ARM 汇编:您只需要学习很少的循环控制/终止指令。

Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.

然后,您可以开始编写本机 NEON 代码,而不必担心编译器会做一些异常的事情,会吐出大量错误/警告。

Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.

学习 NEON 指令比所有这些内在宏指令要求更低。除此之外,结果要好得多。

Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.

完全优化的 NEON 本机代码的运行速度通常是编写良好的内在函数的两倍以上。

Just compare the OP's version with mine in the link below, you'll then know what I mean.

只需在下面的链接中将 OP 的版本与我的版本进行比较,您就会明白我的意思。

Optimizing RGBA8888 to RGB565 conversion with NEON

使用 NEON 优化 RGBA8888 到 RGB565 的转换

regards

问候