C++ 缓存感知编程

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1922249/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 21:36:01  来源:igfitidea点击:

C++ cache aware programming

c++optimizationcachingcpu-cache

提问by Mat

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible? Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?

在 C++ 中有没有办法确定 CPU 的缓存大小?我有一个处理大量数据的算法,我想将这些数据分解成块,以便它们适合缓存。这可能吗?你能给我关于缓存大小编程的任何其他提示吗(特别是在多线程/多核数据处理方面)?

Thanks!

谢谢!

回答by Robert S. Barnes

According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:

根据Ulrich Drepper 的“每个程序员应该知道的内存知识”,您可以在 Linux 上执行以下操作:

Once we have a formula for the memory requirement we can compare it with the cache size. As mentioned before, the cache might be shared with multiple other cores. Currently {There definitely will sometime soon be a better way!} the only way to get correct information without hardcoding knowledge is through the /sys filesystem. In Table 5.2 we have seen the what the kernel publishes about the hardware. A program has to find the directory:

/sys/devices/system/cpu/cpu*/cache

一旦我们有了内存要求的公式,我们就可以将其与缓存大小进行比较。如前所述,缓存可能与多个其他内核共享。目前{很快就会有更好的方法!}在没有硬编码知识的情况下获取正确信息的唯一方法是通过 /sys 文件系统。在表 5.2 中,我们看到了内核发布的有关硬件的信息。程序必须找到目录:

/sys/devices/system/cpu/cpu*/cache

This is listed in Section 6: What Programmers Can Do.

这在第 6 节:程序员可以做什么中列出。

He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.

他还在图 6.5 下方描述了一个简短的测试,如果您无法从操作系统获取 L1D 缓存大小,可以使用该测试确定 L1D 缓存大小。

There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE)is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.

我在他的论文中还遇到了一件事: sysconf(_SC_LEVEL2_CACHE_SIZE)Linux 上的系统调用应该返回 L2 缓存大小,尽管它似乎没有很好的文档记录。

回答by kusma

C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.

C++ 本身并不“关心” CPU 缓存,因此不支持查询语言内置的缓存大小。如果您正在为 Windows 开发,那么可以使用GetLogicalProcessorInformation() 函数,它可用于查询有关 CPU 缓存的信息。

回答by ben

Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.

预分配一个大数组。然后依次访问每个元素并记录每次访问的时间。理想情况下,当发生缓存未命中时,访问时间会出现跳跃。然后你可以计算你的 L1 缓存。它可能行不通,但值得一试。

回答by Clark Gaebel

Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).

有趣的是,我不久前编写了一个程序来执行此操作(虽然是在 C 中,但我确信将其合并到 C++ 代码中会很容易)。

http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c

http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c

The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.

get_cache_line 函数是一个有趣的函数,它返回数组访问时序数据最大峰值之前的位置。它在我的机器上猜对了!如果还有其他事情,它可以帮助您制作自己的作品。

It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/

它基于这篇文章,最初激起了我的兴趣:http: //igoro.com/archive/gallery-of-processor-cache-effects/

回答by Daniel Munoz

You can see this thread: http://software.intel.com/en-us/forums/topic/296674

你可以看到这个线程:http: //software.intel.com/en-us/forums/topic/296674

The short answer is in this other thread:

简短的回答是在另一个线程中:

On modern IA-32 hardware, the cache line size is 64. The value 128 is a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium D) where 64-byte lines are paired into 128-byte sectors. When a line in a sector is fetched, the hardware automatically fetches the other line in the sector too. So from a false sharing perspective, the effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)

在现代 IA-32 硬件上,缓存线大小为 64。值 128 是 Intel Netburst 微体系结构(例如 Intel Pentium D)的遗留值,其中 64 字节线与 128 字节扇区配对。当提取扇区中的一行时,硬件也会自动提取扇区中的另一行。因此,从错误共享的角度来看,Netburst 处理器上的有效行大小为 128 字节。( http://software.intel.com/en-us/forums/topic/292721)

回答by Stéphane Bonniez

Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.

根据您尝试执行的操作,您也可以将其留给某个图书馆。由于您提到了多核处理,您可能想看看Intel Threading Building Blocks

TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator(in the reference documentation, I couldn't find any direct link).

TBB 包括缓存感知内存分配器。更具体地说,检查cache_aligned_allocator(在参考文档中,我找不到任何直接链接)。

回答by Tobias Langner

read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.

读取 cpu (x86) 的 cpuid,然后通过查找表确定缓存大小。该表必须填写 cpu 制造商在其编程手册中发布的缓存大小。

回答by Max

IIRC, GCC has a __builtin_prefetchhint.

IIRC,GCC有__builtin_prefetch提示。

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

has an excellent section on this. Basically, it suggests:

有一个很好的部分。基本上,它建议:

__builtin_prefetch (&array[i + LookAhead], rw, locality);

where rwis a 0 (prepare for read) or 1 (prepare for a write) value, and localityuses the number 0-3, where zero is no locality, and 3 is very strong locality.

其中rw是 0(准备读取)或 1(准备写入)值,locality使用数字 0-3,其中 0 表示没有局部性,3 是非常强的局部性。

Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.

两者都是可选的。LookAhead 将是要向前看的元素数量。如果内存访问是 100 个周期,并且展开的循环相隔两个周期,则 LookAhead 可以设置为 50 或 51。

回答by Philipp Cla?en

There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?

有两种情况需要区分。您是否需要在编译时或运行时知道缓存大小?

Determining the cache-size at compile-time

在编译时确定缓存大小

For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.

对于某些应用程序,您知道代码将运行的确切体系结构,例如,如果您可以直接在主机上编译代码。在这种情况下,简化查找大小并对其进行硬编码是一种选择(可以在构建系统中自动化)。在今天的大多数机器上,L1 缓存行应该是 64 字节。

If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_sizeas a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.

如果您想避免这种复杂性,或者如果您需要支持在未知体系结构上的编译,您可以使用 C++17 特性 std::hardware_constructive_interference_size作为一个很好的后备。它将为缓存行提供编译时估计,但请注意其局限性。请注意,编译器在创建二进制文件时无法完美猜测,因为缓存行的大小通常取决于体系结构。

Determining the cache-size at runtime

在运行时确定缓存大小

At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.

在运行时,您的优势在于您知道确切的机器,但您将需要特定于平台的代码来从操作系统读取信息。这个答案的代码片段是一个很好的起点,它支持主要平台(Windows、Linux、MacOS)。以类似的方式,您还可以在运行时读取 L2 缓存大小。

I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.

我建议不要尝试通过在启动时运行基准测试并测量哪一个性能最好来猜测缓存行。它可能很有效,但如果 CPU 被其他进程使用,它也容易出错。

Combining both approaches

结合两种方法

If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.

如果您必须发布一个二进制文件,并且稍后将在其上运行的机器具有一系列具有不同缓存大小的不同架构,您可以为每个缓存大小创建专门的代码部分,然后动态地(在应用程序启动时)选择最合适的一。

回答by Charles Eli Cheese

The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

缓存通常会做正确的事情。普通程序员唯一真正担心的是错误共享,并且您无法在运行时处理它,因为它需要编译器指令。