multithreading 锁定未锁定的互斥锁的效率如何?互斥锁的成本是多少?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3652056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How efficient is locking an unlocked mutex? What is the cost of a mutex?
提问by Albert
In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
在低级语言(C、C++ 或其他语言)中:我可以选择拥有一堆互斥锁(例如 pthread 给我的或本机系统库提供的任何互斥锁)或一个对象的单个互斥锁。
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
锁定互斥锁的效率如何?即可能有多少汇编指令以及它们需要多长时间(在互斥锁被解锁的情况下)?
How much does a mutex cost? Is it a problem to have really a lotof mutexes? Or can I just throw as much mutex variables in my code as I have int
variables and it doesn't really matter?
互斥锁的成本是多少?真的有很多互斥体是一个问题吗?或者我可以在我的代码中抛出尽可能多的互斥变量,因为我有int
变量,这并不重要吗?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
(我不确定不同硬件之间有多少差异。如果有,我也想了解它们。但大多数情况下,我对通用硬件感兴趣。)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
关键是,通过使用许多互斥锁,每个互斥锁只覆盖对象的一部分而不是整个对象的单个互斥锁,我可以保护许多块。我想知道我应该在这方面走多远。即我应该尽可能地尝试保护任何可能的块,无论这意味着多么复杂和多少互斥?
WebKits blog post (2016) about lockingis very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.
关于锁定的 WebKits 博客文章 (2016)与此问题非常相关,并解释了自旋锁、自适应锁、futex 等之间的区别。
采纳答案by Dummy00001
I have the choice in between either having a bunch of mutexes or a single one for an object.
我可以在对象的一组互斥锁或单个互斥锁之间进行选择。
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
如果你有很多线程并且经常访问对象,那么多个锁会增加并行度。以可维护性为代价,因为更多的锁定意味着更多的锁定调试。
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
锁定互斥锁的效率如何?即可能有多少汇编指令以及它们需要多长时间(在互斥锁被解锁的情况下)?
The precise assembler instructions are the least overhead of a mutex- the memory/cache coherencyguarantees are the main overhead. And less often a particular lock is taken - better.
精确的汇编指令是互斥锁的最小开销-内存/缓存一致性保证是主要开销。并且不太经常使用特定的锁 - 更好。
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
互斥锁由两个主要部分组成(过于简化):(1) 指示互斥锁是否被锁定的标志和 (2) 等待队列。
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
标志的更改只是几条指令,通常无需系统调用即可完成。如果互斥锁被锁定,系统调用会恰巧将调用线程加入等待队列并开始等待。如果等待队列为空,则解锁很便宜,但否则需要系统调用来唤醒等待进程之一。(在某些系统上,使用廉价/快速的系统调用来实现互斥锁,它们仅在发生争用时才变成慢(正常)系统调用。)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
锁定未锁定的互斥锁真的很便宜。解锁没有争用的互斥锁也很便宜。
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
互斥锁的成本是多少?真的有很多互斥体是一个问题吗?或者我可以在我的代码中抛出尽可能多的互斥变量,因为我有 int 变量,这并不重要吗?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
您可以根据需要将尽可能多的互斥变量放入代码中。您仅受应用程序可以分配的内存量的限制。
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
概括。用户空间锁(尤其是互斥锁)很便宜并且不受任何系统限制。但是,其中太多为调试带来了噩梦。简单表:
- Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
- Less locks means less problems debugging multi-threading problems.
- More locks means less contentions and higher parallelism
- More locks means more chances of running into undebugable deadlocks.
- 更少的锁意味着更多的争用(缓慢的系统调用、CPU 停顿)和更少的并行性
- 更少的锁意味着调试多线程问题的问题更少。
- 更多的锁意味着更少的争用和更高的并行度
- 更多的锁意味着更多的机会遇到无法调试的死锁。
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
应该找到并维护应用程序的平衡锁定方案,通常平衡 #2 和 #3。
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls(search for "stall").
(*) 不太经常锁定互斥锁的问题是,如果您的应用程序中锁定太多,它会导致大部分 CPU/核心流量从其他 CPU 的数据缓存中刷新互斥内存,以保证缓存一致性。缓存刷新就像轻量级中断,由 CPU 透明处理 - 但它们确实引入了所谓的停顿(搜索“停顿”)。
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
而停顿是导致锁定代码运行缓慢的原因,通常没有任何明显的迹象表明为什么应用程序很慢。(一些 arch 提供了 CPU/核心间的流量统计信息,有些则没有。)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.
为了避免这个问题,人们通常会使用大量的锁来降低锁争用的概率,避免卡顿。这就是为什么存在不受系统限制的廉价用户空间锁定的原因。
回答by Carlo Wood
I wanted to know the same thing, so I measured it. On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz), locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
我想知道同样的事情,所以我测量了它。在我的机器上(3.612361 GHz 的 AMD FX(tm)-8150 八核处理器),锁定和解锁位于其自己的缓存行中且已经缓存的未锁定互斥锁需要 47 个时钟(13 ns)。
Due to synchronization between two cores (I used CPU #0 and #1), I could only call a lock/unlock pair once every 102 ns on two threads, so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
由于两个内核之间的同步(我使用了 CPU #0 和 #1),我只能在两个线程上每 102 ns 调用一次锁定/解锁对,所以每 51 ns 一次,从中可以得出结论,大约需要 38 ns 在线程解锁之后恢复,然后下一个线程可以再次锁定它。
The program that I used to investigate this can be found here: https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
我用来调查这个的程序可以在这里找到:https: //github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
请注意,它有一些特定于我的盒子的硬编码值(xrange、yrange 和 rdtsc 开销),因此您可能必须对其进行试验,然后才能为您工作。
The graph it produces in that state is:
它在该状态下生成的图形是:
This shows the result of benchmark runs on the following code:
这显示了在以下代码上运行基准测试的结果:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl , %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl , %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
两个 rdtsc 调用测量锁定和解锁“互斥锁”所需的时钟数(我的机器上的 rdtsc 调用有 39 个时钟的开销)。第三个 asm 是一个延迟循环。线程 1 的延迟循环的大小比线程 0 小 1 个计数,因此线程 1 稍微快一些。
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
上述函数在大小为 100,000 的紧密循环中调用。尽管线程 1 的函数稍微快一些,但由于调用了互斥锁,两个循环都会同步。这在图中可以看出,因为线程 1 的锁定/解锁对测量的时钟数量稍大,以解释其下方循环中较短的延迟。
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
在上图中,右下角的点是延迟 loop_count 为 150 的测量值,然后在底部的点之后,向左,loop_count 每次测量减少 1。当它变为 77 时,该函数在两个线程中每 102 ns 调用一次。如果随后 loop_count 进一步减少,则不再可能同步线程并且互斥锁在大部分时间实际上开始被锁定,从而导致执行锁定/解锁所需的时钟数量增加。函数调用的平均时间也因此增加;所以情节点现在再次向右移动。
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
由此我们可以得出结论,每 50 ns 锁定和解锁一个互斥锁在我的盒子上不是问题。
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
总而言之,我的结论是,OP 问题的答案是添加更多互斥锁会更好,只要这会导致更少的争用。
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
尝试锁定尽可能短的互斥锁。将它们放在循环之外的唯一原因是,如果该循环每 100 ns 循环一次(或者更确切地说,想要同时运行该循环的线程数为 50 ns)或 13 ns 次循环大小比争用延迟更多。
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
编辑:我现在对这个主题有了更多的了解,并开始怀疑我在这里提出的结论。首先,CPU 0 和 1 被证明是超线程的;尽管 AMD 声称拥有 8 个真正的内核,但肯定有一些非常可疑的东西,因为其他两个内核之间的延迟要大得多(即 0 和 1 形成一对,2 和 3、4 和 5、6 和 7 也是如此) )。其次, std::mutex 的实现方式是,当它无法立即获得互斥锁上的锁(这无疑会非常慢)时,它会在实际执行系统调用之前先旋转一点锁。所以我在这里测量的是绝对最理想的情况,实际上锁定和解锁每次锁定/解锁可能需要更多的时间。
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).
最重要的是,互斥锁是用原子实现的。为了在内核之间同步原子,必须锁定内部总线,这会冻结相应的缓存线数百个时钟周期。在无法获得锁的情况下,必须执行系统调用使线程休眠;这显然非常慢(系统调用大约为 10 微秒)。通常这不是一个真正的问题,因为无论如何该线程都必须休眠 - 但它可能是高争用的问题,其中线程无法在正常旋转的时间内获得锁,系统调用也是如此,但是可以在那之后不久拿锁。例如,如果多个线程在一个紧密循环中锁定和解锁一个互斥锁,并且每个线程保持锁定 1 微秒左右,那么他们可能会因为不断地入睡和再次醒来而大大减慢速度。此外,一旦一个线程休眠而另一个线程必须唤醒它,该线程必须执行系统调用并延迟约 10 微秒;因此,当另一个线程在内核中等待该互斥锁时(在旋转花费太长时间后)解锁互斥锁时会发生这种延迟。
回答by valdo
This depends on what you actually call "mutex", OS mode and etc.
这取决于您实际所说的“互斥锁”、操作系统模式等。
At minimumit's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
在最低限度它是一个互锁内存操作的成本。这是一个相对繁重的操作(与其他原始汇编命令相比)。
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is veryheavy.
但是,这可能要高得多。如果您将“互斥锁”称为内核对象(即 - 由操作系统管理的对象)并在用户模式下运行 - 对它的每个操作都会导致内核模式事务,这是非常繁重的。
For example on Intel Core Duo processor, Windows XP. Interlocked operation: takes about 40 CPU cycles. Kernel mode call (i.e. system call) - about 2000 CPU cycles.
例如在 Intel Core Duo 处理器、Windows XP 上。互锁操作:大约需要 40 个 CPU 周期。内核模式调用(即系统调用) - 大约 2000 个 CPU 周期。
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.
如果是这种情况 - 您可以考虑使用临界区。它是内核互斥锁和互锁内存访问的混合体。
回答by paxdiablo
The cost will vary depending on the implementation but you should keep in mind two things:
成本会因实施而异,但您应该记住两件事:
- the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
- it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
- 成本将最有可能是最小的,因为这既是一种很原始的操作,它会尽可能多地由于其使用模式(使用进行优化很多)。
- 它有多昂贵并不重要,因为如果您想要安全的多线程操作,就需要使用它。如果你需要它,那么你需要它。
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-setstrategy.
在单处理器系统上,您通常可以禁用足够长的中断以自动更改数据。多处理器系统可以使用测试和设置策略。
In both those cases, the instructions are relatively efficient.
在这两种情况下,指令都相对有效。
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
至于是否应该为海量数据结构提供单个互斥锁,还是应该有多个互斥锁,每个部分一个,这是一种平衡。
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)
通过使用单个互斥锁,多个线程之间的争用风险更高。您可以通过为每个部分设置一个互斥体来降低这种风险,但您不想陷入线程必须锁定 180 个互斥体才能完成其工作的情况:-)
回答by Grant Petty
I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
我对 pthreads 和 mutex 完全陌生,但我可以从实验中确认,在没有争用的情况下,锁定/解锁互斥锁的成本几乎为零,但是当有争用时,阻塞的成本非常高。我使用线程池运行了一个简单的代码,其中的任务只是计算受互斥锁保护的全局变量中的总和:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.
使用一个线程,该程序几乎立即(不到一秒)汇总了 10,000,000 个值;使用两个线程(在 4 核 MacBook 上),相同的程序需要 39 秒。