C++ 如何以及何时与缓存行大小对齐?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8469427/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 18:37:54  来源:igfitidea点击:

How and when to align to cache line size?

c++ccaching

提问by Matt

In Dmitry Vyukov's excellent bounded mpmc queue written in C++ See: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

在 Dmitry Vyukov 用 C++ 编写的优秀有界 mpmc 队列中参见:http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

He adds some padding variables. I presume this is to make it align to a cache line for performance.

他添加了一些填充变量。我认为这是为了使其与缓存线对齐以提高性能。

I have some questions.

我有一些问题。

  1. Why is it done in this way?
  2. Is it a portable method that will always work
  3. In what cases would it be best to use __attribute__ ((aligned (64)))instead.
  4. why would padding before a buffer pointer help with performance? isn't just the pointer loaded into the cache so it's really only the size of a pointer?

    static size_t const     cacheline_size = 64;
    typedef char            cacheline_pad_t [cacheline_size];
    
    cacheline_pad_t         pad0_;
    cell_t* const           buffer_;
    size_t const            buffer_mask_;
    cacheline_pad_t         pad1_;
    std::atomic<size_t>     enqueue_pos_;
    cacheline_pad_t         pad2_;
    std::atomic<size_t>     dequeue_pos_;
    cacheline_pad_t         pad3_;
    
  1. 为什么要这样做?
  2. 它是一种始终有效的便携式方法吗
  3. 在什么情况下最好__attribute__ ((aligned (64)))改用。
  4. 为什么在缓冲区指针之前填充有助于提高性能?不只是加载到缓存中的指针,所以它实际上只是指针的大小?

    static size_t const     cacheline_size = 64;
    typedef char            cacheline_pad_t [cacheline_size];
    
    cacheline_pad_t         pad0_;
    cell_t* const           buffer_;
    size_t const            buffer_mask_;
    cacheline_pad_t         pad1_;
    std::atomic<size_t>     enqueue_pos_;
    cacheline_pad_t         pad2_;
    std::atomic<size_t>     dequeue_pos_;
    cacheline_pad_t         pad3_;
    

Would this concept work under gcc for c code?

这个概念在 gcc 下适用于 c 代码吗?

采纳答案by Phil Miller

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

这样做是为了使修改不同字段的不同内核不必在它们的缓存之间反弹包含它们的缓存行。通常,对于处理器访问内存中的某些数据,包含它的整个缓存行必须在该处理器的本地缓存中。如果它正在修改该数据,则该缓存条目通常必须是系统中任何缓存中的唯一副本(MESI/MOESI 样式缓存一致性协议中的独占模式)。当不同的内核尝试修改恰好位于同一缓存行上的不同数据,从而浪费时间来回移动整条缓存时,这称为错误共享

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

在您给出的特定示例中,一个核心可以将条目入队(仅读取(共享)buffer_和写入(独占)enqueue_pos_),而另一个核心出列(共享buffer_和独占dequeue_pos_),而没有任何一个核心在另一个拥有的缓存线上停顿。

The padding at the beginning means that buffer_and buffer_mask_end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

开头的填充意味着buffer_buffer_mask_最终在同一缓存行上,而不是分成两条线,因此需要双倍的内存流量才能访问。

I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_twill itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members.(see comments)

我不确定该技术是否完全可移植。假设是每个cacheline_pad_t本身都将与 64 字节(其大小)缓存线边界对齐,因此接下来的任何内容都将位于下一个缓存线。据我所知,C 和 C++ 语言标准只需要整个结构的这一点,这样它们就可以很好地存在于数组中,而不会违反其任何成员的对齐要求。(看评论)

The attributeapproach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

attribute方法将更加特定于编译器,但可能会将这个结构的大小减少一半,因为填充将仅限于将每个元素四舍五入到一个完整的缓存行。如果一个人有很多这些,那将是非常有益的。

The same concept applies in C as well as C++.

相同的概念适用于 C 和 C++。

回答by Phil Miller

You may need to align to a cache line boundary, which is typically 64 bytes per cache line, when you are working with interrupts or high-performance data reads, and they are mandatory to use when working with interprocess sockets. With Interprocess sockets, there are control variables that cannot be spread out across multiple cache lines or DDR RAM words else it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter and filter out your interrupt data! THAT IS BAD!!! That means you get bizarre errors when your algorithm is good and it has the potential to make you go insane!

当您处理中断或高性能数据读取时,您可能需要对齐缓存线边界,通常每个缓存线 64 字节,并且在使用进程间套接字时必须使用它们。对于进程间套接字,有一些控制变量不能分布在多个缓存行或 DDR RAM 字中,否则它会导致 L1、L2 等或缓存或 DDR RAM 用作低通滤波器并过滤掉您的中断数据!那很不好!!!这意味着当您的算法很好时,您会遇到奇怪的错误,并且有可能让您发疯!

The DDR RAM is almost always going to read in 128-bit words (DDR RAM Words), which is 16 bytes, so the ring buffer variables shall not be spread out across multiple DDR RAM words. some systems do use 64-bit DDR RAM words and technically you could get a 32-bit DDR RAM word on a 16-bit CPU but one would use SDRAM in the situation.

DDR RAM 几乎总是以 16 字节的 128 位字(DDR RAM 字)读取,因此环形缓冲区变量不应分布在多个 DDR RAM 字中。有些系统确实使用 64 位 DDR RAM 字,从技术上讲,您可以在 16 位 CPU 上获得 32 位 DDR RAM 字,但在这种情况下会使用 SDRAM。

One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:

人们也可能只是对在高性能算法中读取数据时最小化使用的缓存线数量感兴趣。就我而言,我开发了世界上最快的整数到字符串算法(比之前最快的算法快 40%),并且我正在优化 Grisu 算法,这是世界上最快的浮点算法。为了打印浮点数,您必须打印整数,因此为了优化 Grisu,我实施的一项优化是将 Grisu 的查找表 (LUT) 缓存行对齐到 15 个缓存行,即很奇怪,它实际上是这样对齐的。这从 .bss 部分(即静态内存)中获取 LUT 并将它们放入堆栈(或堆,但堆栈更合适)。我没有对此进行基准测试,但提出来很好,我在这方面学到了很多东西,加载值的最快方法是从 i-cache 而不是 d-cache 加载它们。不同之处在于 i-cache 是只读的并且有更大的缓存行,因为它是只读的(2KB 是一位教授曾经引用我的话。)。因此,您实际上将降低数组索引的性能,而不是像这样加载变量:

int faster_way = 12345678;

as opposed to the slower way:

与较慢的方式相反:

int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];

The difference is that the int variable = 12345678will get loaded from the i-cache lines by offsetting to the variable in the i-cache from the start of the function, while slower_way = int[0]will get loaded from the smaller d-cache lines using much slower array indexing. This particular subtly as I just discovered is actually slowing down my and many others integer-to-string algorithm. I say this because you may thing you're optimizing by cache-aligning read-only data when you're not.

不同之处在于,int variable = 12345678将通过从函数开始时偏移到 i-cache 中的变量来从 i-cache 行slower_way = int[0]加载,而将从使用慢得多的数组索引的较小 d-cache 行加载。正如我刚刚发现的那样,这种特别巧妙的做法实际上减慢了我和许多其他整数到字符串算法的速度。我这样说是因为您可能会通过缓存对齐只读数据来优化您的优化。

Typically in C++, you will use the std::alignfunction. I would advise not using this function because it is not guaranteed to work optimally. Here is the fastest way to align to a cache line, which to be up front I am the author and this is a shamless plug:

通常在 C++ 中,您将使用该std::align函数。我建议不要使用此功能,因为它不能保证以最佳方式工作。这是对齐缓存行的最快方法,首先我是作者,这是一个无耻的插件:

Kabuki Toolkit Memory Alignment Algorithm

Kabuki Toolkit 内存对齐算法

namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
@return An aligned pointer of typename T.
@brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
@param pointer The pointer to align.
@param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
  intptr_t value = reinterpret_cast<intptr_t>(pointer);
  value += (-value ) & mask;
  return reinterpret_cast<T*>(value);
}
} //< namespace _

// Example calls using the faster mask technique.

enum { kSize = 256 };
char buffer[kSize + 64];

char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);

char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);

and here is the faster std::align replacement:

这是更快的 std::align 替换:

inline void* align_kabuki(size_t align, size_t size, void*& ptr,
                          size_t& space) noexcept {
  // Begin Kabuki Toolkit Implementation
  intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & (align - 1);
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
  // End Kabuki Toolkit Implementation
}