C++ 如何有效地使用 std::atomic

Question

提问by Kan Li

std::atomic is new feature introduced by c++11 but I can't find much tutorial on how to use it correctly. So are the following practice common and efficient?

std::atomic 是 c++11 引入的新功能，但我找不到很多关于如何正确使用它的教程。那么下面的做法是否普遍有效呢？

One practice I used is we have a buffer and I want to CAS on some bytes, so what I did was:

我使用的一种做法是我们有一个缓冲区，我想对一些字节进行 CAS，所以我所做的是：

uint8_t *buf = ....
auto ptr = reinterpret_cast<std::atomic<uint8_t>*>(&buf[index]);
uint8_t oldValue, newValue;
do {
  oldValue = ptr->load();
  // Do some computation and calculate the newValue;
  newValue = f(oldValue);
} while (!ptr->compare_exchange_strong(oldValue, newValue));

So my questions are:

所以我的问题是：

The above code uses ugly reinterpret_cast and is this the correct way to retrieve the atomic pointer that reference to the location &buf[index]?
Is the CAS on a single byte significantly slower than CAS on a machine word, so that I should avoid using it? My code will look more complicated if I change it to load a word, extract the byte, compute and set the byte in the new value, and do CAS. This makes the code more complicated and I also need to deal with address alignment myself.

上面的代码使用了丑陋的 reinterpret_cast，这是检索引用位置 &buf[index] 的原子指针的正确方法吗？
单个字节上的 CAS 是否比机器字上的 CAS 慢得多，所以我应该避免使用它？如果我将代码更改为加载单词、提取字节、计算并在新值中设置字节并执行 CAS，我的代码看起来会更复杂。这使得代码更加复杂，我还需要自己处理地址对齐问题。

EDIT: if those questions are processor/architecture dependent, then what's the conclusion for x86/x64 processors?

编辑：如果这些问题取决于处理器/架构，那么 x86/x64 处理器的结论是什么？

Answer 1

回答by Anthony Williams

The reinterpret_castwill yield undefined behaviour. Your variable is either a std::atomic<uint8_t>or a plain uint8_t; you cannot cast between them. The size and alignment requirements may be different, for example. e.g. some platforms only provide atomic operations on words, so std::atomic<uint8_t>will use a full machine word where plain uint8_tcan just use a byte. Non-atomic operations may also be optimized in all sorts of ways, including being significantly reordered with surrounding operations, and combined with other operations on adjacent memory locations where that can improve performance.
This does mean that if you want atomic operations on some data then you have to know that in advance, and create suitable std::atomic<>objects rather than just allocating a generic buffer. Of course, you could allocate a buffer and then use placement newto initialize your atomic variable in that buffer, but you'd have to ensure the size and alignment were correct, and you wouldn't be able to use non-atomic operations on that object.
If you really don't care about ordering constraints on your atomic object then use memory_order_relaxedon what would otherwise be the non-atomic operations. However, be aware that this is highly specialized, and requires great care. For example, writes to distinct variables may be read by other threads in a different order than they were written, and different threads may read the values in different orders to each other, even within the same execution of the program.
If CAS is slower for a byte than a word, you maybe better off using std::atomic<unsigned>, but this will have a space penalty, and you certainly can't just use std::atomic<unsigned>to access a sequence of raw bytes --- all operations on that data must be through the same std::atomic<unsigned>object. You are generally better off writing code that does what you need and letting the compiler figure out the best way to do that.

该reinterpret_cast会产生不确定的行为。您的变量是 astd::atomic<uint8_t>或 plain uint8_t；你不能在他们之间施法。例如，尺寸和对齐要求可能不同。例如，某些平台仅提供对字的原子操作，因此std::atomic<uint8_t>将使用完整的机器字，而普通的uint8_t只能使用一个字节。非原子操作也可以通过各种方式进行优化，包括与周围操作显着重新排序，以及与相邻内存位置上的其他操作相结合，以提高性能。
这确实意味着如果您想要对某些数据进行原子操作，那么您必须提前知道这一点，并创建合适的std::atomic<>对象，而不仅仅是分配通用缓冲区。当然，您可以分配一个缓冲区，然后使用放置new来初始化该缓冲区中的原子变量，但是您必须确保大小和对齐方式正确，并且您将无法对其使用非原子操作目的。
如果您真的不关心对原子对象的排序约束，那么请使用memory_order_relaxed非原子操作。但是，请注意，这是高度专业化的，需要非常小心。例如，对不同变量的写入可能被其他线程以与写入不同的顺序读取，并且不同的线程可能以彼此不同的顺序读取值，即使在程序的同一执行中也是如此。
如果 CAS 对于一个字节比一个字慢，你可能最好使用std::atomic<unsigned>，但这会造成空间损失，而且你当然不能只std::atomic<unsigned>用来访问原始字节序列——对该数据的所有操作都必须通过同一个std::atomic<unsigned>对象。您通常最好编写满足您需要的代码，并让编译器找出最佳方法。

For x86/x64, with a std::atomic<unsigned>variable a, a.load(std::memory_order_acquire)and a.store(new_value,std::memory_order_release)are no more expensive than loads and stores to non-atomic variables as far as the actual instructions go, but they do limit the compiler optimizations. If you use the default std::memory_order_seq_cstthen one or both of these operations will incur the synchronization cost of a LOCKed instruction or a fence (my implementationputs the price on the store, but other implementations may choose differently). However, memory_order_seq_cstoperations are easier to reason about due to the "single total ordering" constraint they impose.

对于 x86/x64，使用std::atomic<unsigned>变量a,a.load(std::memory_order_acquire)并且a.store(new_value,std::memory_order_release)就实际指令而言并不比加载和存储非原子变量更昂贵，但它们确实限制了编译器优化。如果您使用默认值，std::memory_order_seq_cst那么这些操作中的一个或两个将产生LOCKed 指令或栅栏的同步成本（我的实现将价格放在存储上，但其他实现可能会选择不同）。然而，memory_order_seq_cst由于它们强加的“单一总排序”约束，操作更容易推理。

In many cases it is just as fast, and a lot less error-prone, to use locks rather than atomic operations. If the overhead of a mutex lock is significant due to contention then you might need to rethink your data access patterns --- cache ping pong may well hit you with atomics anyway.

在许多情况下，使用锁而不是原子操作同样快，而且不容易出错。如果由于争用而导致互斥锁的开销很大，那么您可能需要重新考虑您的数据访问模式——无论如何，缓存乒乓很可能会用原子攻击您。

Answer 2

回答by Dietmar Kühl

Your code is certainly wrong and bound to do something funny. If things go really bad it might do what you think it is intended to do. I wouldn't go as far as understanding how to properly use e.g. CAS but you would use std::atomic<T>something like this:

你的代码肯定是错误的，肯定会做一些有趣的事情。如果事情真的很糟糕，它可能会做你认为它打算做的事情。我不会去了解如何正确使用例如 CAS 但你会使用std::atomic<T>这样的东西：

std::atomic<uint8_t> value(0); 
uint8_t oldvalue, newvalue;
do
{
    oldvalue = value.load();
    newvalue = f(oldvalue);
}
while (!value.compare_exchange_strong(oldvalue, newvalue));

So far my personal policy is to stay away from any of this lock-free stuff and leave it to people who know what they are doing. I would use atomic_flag and possibly counters and that is about as far as I'd go. Conceptually I understand how this lock-free stuff work but I also understand that there are way too many things which can go wrong if you are not extremely careful.

到目前为止，我的个人政策是远离任何这种无锁的东西，把它留给知道自己在做什么的人。我会使用 atomic_flag 和可能的计数器，这就是我所能做的。从概念上讲，我理解这种无锁的东西是如何工作的，但我也明白，如果您不非常小心，可能会出错的事情太多了。

Answer 3

回答by Grizzly

Your reinterpret_cast<std::atomic<uint8_t>*>(...)is most definatly not the correct way to retrieve an atomic and not even guranteed to work. This is because std::atomic<T>is not guaranteed to have the same size as T.

你reinterpret_cast<std::atomic<uint8_t>*>(...)绝对不是检索原子的正确方法，甚至不能保证工作。这是因为std::atomic<T>不能保证与T.

To your second question about CAS being slower for bytes then machine words: That's really machine dependent, it might be faster, it might be slower, or there might not even exist CAS for bytes on your Target architecture. In the later case the implementation will most likely either need to use a locking implementation for the atomic or use a different (bigger) type internally (which is one example of atomics not having the same size as the underlying type).

关于 CAS 对于字节比机器字慢的第二个问题：这真的取决于机器，它可能更快，也可能更慢，或者在你的目标架构上甚至可能不存在字节的 CAS。在后一种情况下，实现很可能需要为原子使用锁定实现，或者在内部使用不同的（更大的）类型（这是原子与底层类型不具有相同大小的一个示例）。

From what I see there is really no way to get an std::atomicon an existing value, particularly since they aren't guaranteed to be the same size. Therefore you really should directly make bufan std::atomic<uint8_t>*. Furthermore I'm relatively sure that even if such a cast would work, access through non atomics to the same address wouldn't be guaranteed to work as expected (since this access isn't guaranteed to be atomic even for bytes). So having nonatomic means to access a memory location you want to do atomic operations on doesn't really make sense.

从我看到的情况来看，确实无法获得std::atomic现有值，特别是因为不能保证它们的大小相同。因此你真的应该直接制作buf一个std::atomic<uint8_t>*. 此外，我相对确定，即使这样的转换有效，也不能保证通过非原子访问同一地址按预期工作（因为即使对于字节，这种访问也不能保证是原子的）。因此，非原子意味着访问您想要对其进行原子操作的内存位置并没有真正意义。

Note that for common architectures stores and loads of bytes are atomic anyways, so you have little to no performance overhead for using atomics there, as long as you use relaxed memory order for those operations. So if you don't really care about order of execution at one point (e.g. because the program isn't multithreaded yet) simply use a.store(0, std::memory_order_relaxed)instead of a.store(0).

请注意，对于常见的架构，存储和字节加载无论如何都是原子的，因此只要您对这些操作使用宽松的内存顺序，在那里使用原子几乎没有性能开销。所以，如果你真的不于一点关心执行顺序（例如，因为该程序还没有多线程），只需使用a.store(0, std::memory_order_relaxed)代替a.store(0)。

Of course if you are only talking about x86 your reinterpret_castis likely to work, but your performance question is probably still processor dependent (I think, I haven't looked up the actual instruction timings for cmpxchg).

当然，如果您只是在谈论 x86，您reinterpret_cast可能会工作，但您的性能问题可能仍然取决于处理器（我想，我还没有查找的实际指令时序cmpxchg）。

C++ 如何有效地使用 std::atomic

提问by Kan Li

回答by Anthony Williams

回答by Dietmar Kühl

回答by Grizzly

相关推荐

最近更新

标签

C++ 如何有效地使用 std::atomic

提问by Kan Li

回答by Anthony Williams

回答by Dietmar Kühl

回答by Grizzly

相关推荐

C++ 从 DLL 动态加载函数

C++ Qt 信号可以返回一个值吗？

在 C++ 程序的 main 函数中，`return 0` 是什么意思？

C++ 两个 SYSTEMTIME 变量之间的区别

相关推荐

最近更新

标签