C++ 如何估计线程上下文切换开销？

Question

提问by Ignas Limanauskas

I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that high frequency of thread switching might be causing tangible overhead, but can neither prove it or disprove it. As everybody knows, lack of proof is not a proof of opposite :).

我正在尝试通过实时截止日期提高线程应用程序的性能。它在 Windows Mobile 上运行并用 C/C++ 编写。我怀疑线程切换的高频率可能会导致有形的开销，但既不能证明也不能反驳。众所周知，缺乏证据不是相反的证据:)。

Thus my question is twofold:

因此，我的问题是双重的：

If exists at all, where can I find any actual measurements of the cost of switching thread context?
Without spending time writing a test application, what are the ways to estimate the thread switching overhead in the existing application?
Does anyone know a way to find out the number of context switches (on / off) for a given thread?

如果存在，我在哪里可以找到切换线程上下文成本的任何实际测量值？
不花时间写一个测试应用程序，有什么方法可以估算现有应用程序中的线程切换开销？
有没有人知道一种方法来找出给定线程的上下文切换（开/关）次数？

Answer 1

采纳答案by OregonGhost

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

虽然您说您不想编写测试应用程序，但我之前在 ARM9 Linux 平台上进行的测试是这样做的，以了解开销是多少。只有两个线程会 boost::thread::yield() （或者，你知道）并增加一些变量，大约一分钟后（没有其他正在运行的进程，至少没有做某事），应用程序打印它每秒可以进行多少上下文切换。当然，这并不准确，但关键是两个线程都将 CPU 交给了对方，而且速度如此之快，以至于考虑开销没有任何意义。因此，只需继续编写一个简单的测试，而不是过多考虑可能不存在的问题。

Other than that, you might try like 1800 suggested with performance counters.

除此之外，您可以尝试使用性能计数器建议的 1800。

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).

哦，我记得有一个运行在 Windows CE 4.X 上的应用程序，其中我们也有四个线程，有时会进行密集切换，而且从未遇到性能问题。我们还尝试在完全没有线程的情况下实现核心线程，但没有看到性能提升（GUI 只是响应慢得多，但其他一切都一样）。也许您可以通过减少上下文切换的数量或完全删除线程（仅用于测试）来尝试相同的方法。

Answer 2

回答by Mecki

I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

我怀疑您是否可以在任何现有平台的网络上找到这种开销。存在太多不同的平台。开销取决于两个因素：

The CPU, as the necessary operations may be easier or harder on different CPU types
The system kernel, as different kernels will have to perform different operations on each switch

CPU，因为在不同的 CPU 类型上，必要的操作可能更容易或更难
系统内核，因为不同的内核必须在每个交换机上执行不同的操作

Other factors include how the switch takes place. A switch can take place when

其他因素包括转换是如何发生的。切换可以发生在

the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who's next.
the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread ownsthe CPU right now, when the user types something or clicks something, he doesn't want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.
the thread doesn't need CPU time anymore, because it's blocking on some operation or just called sleep() (or similar) to stop running.

该线程已使用其所有时间量程。当一个线程启动时，它可能会运行给定的时间，然后它必须将控制权返回给将决定下一个是谁的内核。
线程被抢占。当另一个线程需要 CPU 时间并且具有更高的优先级时，就会发生这种情况。例如，处理鼠标/键盘输入的线程可能就是这样的线程。不管现在哪个线程拥有CPU，当用户输入或点击什么时，他不想等到当前线程时间量完全用完，他想看到系统立即做出反应。因此，一些系统会立即停止当前线程并将控制权返回给其他具有更高优先级的线程。
线程不再需要 CPU 时间，因为它在某些操作上阻塞或只是调用 sleep()（或类似的）来停止运行。

These 3 scenarios might have different thread switching times in theory. E.g. I'd expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

这 3 种情况理论上可能有不同的线程切换时间。例如，我希望最后一个最慢，因为调用 sleep() 意味着 CPU 返回给内核，内核需要设置一个唤醒调用，以确保线程在大约之后被唤醒它请求休眠的时间量，然后必须将线程从调度进程中取出，一旦线程被唤醒，它必须再次将该线程添加到调度进程中。所有这些陡坡都需要一些时间。因此，实际的睡眠调用可能比切换到另一个线程所需的时间更长。

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it's hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more "realistic" if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn't it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

我想如果你想确定，你必须进行基准测试。问题是您通常必须让线程进入睡眠状态，或者您必须使用互斥锁来同步它们。休眠或锁定/解锁互斥体本身有开销。这意味着您的基准测试也将包括这些开销。如果没有强大的分析器，以后很难说实际切换使用了多少 CPU 时间以及睡眠/互斥调用使用了多少。另一方面，在现实生活中，您的线程也将通过锁休眠或同步。纯粹测量上下文切换时间的基准是综合基准，因为它不模拟任何现实生活场景。如果基准基于现实生活场景，则它们会更加“现实”。告诉我我的 GPU 理论上每秒可以处理 20 亿个多边形的 GPU 基准有什么用处，如果这个结果在现实生活中的 3D 应用程序中永远无法实现？知道现实生活中的 3D 应用程序可以让 GPU 每秒处理多少个多边形不是更有趣吗？

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

不幸的是，我对 Windows 编程一无所知。我可以用 Java 或 C# 为 Windows 编写应用程序，但 Windows 上的 C/C++ 让我哭了。我只能为您提供一些 POSIX 的源代码。

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>

uint32_t COUNTER;
pthread_mutex_t LOCK;
pthread_mutex_t START;
pthread_cond_t CONDITION;

void * threads (
    void * unused
) {
    // Wait till we may fire away
    pthread_mutex_lock(&START);
    pthread_mutex_unlock(&START);

    pthread_mutex_lock(&LOCK);
    // If I'm not the first thread, the other thread is already waiting on
    // the condition, thus Ihave to wake it up first, otherwise we'll deadlock
    if (COUNTER > 0) {
        pthread_cond_signal(&CONDITION);
    }
    for (;;) {
        COUNTER++;
        pthread_cond_wait(&CONDITION, &LOCK);
        // Always wake up the other thread before processing. The other
        // thread will not be able to do anything as long as I don't go
        // back to sleep first.
        pthread_cond_signal(&CONDITION);
    }
    pthread_mutex_unlock(&LOCK); //To unlock
}

int64_t timeInMS ()
{
    struct timeval t;

    gettimeofday(&t, NULL);
    return (
        (int64_t)t.tv_sec * 1000 +
        (int64_t)t.tv_usec / 1000
    );
}


int main (
    int argc,
    char ** argv
) {
    int64_t start;
    pthread_t t1;
    pthread_t t2;
    int64_t myTime;

    pthread_mutex_init(&LOCK, NULL);
    pthread_mutex_init(&START, NULL);   
    pthread_cond_init(&CONDITION, NULL);

    pthread_mutex_lock(&START);
    COUNTER = 0;
    pthread_create(&t1, NULL, threads, NULL);
    pthread_create(&t2, NULL, threads, NULL);
    pthread_detach(t1);
    pthread_detach(t2);
    // Get start time and fire away
    myTime = timeInMS();
    pthread_mutex_unlock(&START);
    // Wait for about a second
    sleep(1);
    // Stop both threads
    pthread_mutex_lock(&LOCK);
    // Find out how much time has really passed. sleep won't guarantee me that
    // I sleep exactly one second, I might sleep longer since even after being
    // woken up, it can take some time before I gain back CPU time. Further
    // some more time might have passed before I obtained the lock!
    myTime = timeInMS() - myTime;
    // Correct the number of thread switches accordingly
    COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);
    printf("Number of thread switches in about one second was %u\n", COUNTER);
    return 0;
}

Output

输出

Number of thread switches in about one second was 108406

Over 100'000 is not too bad and that even though we have locking and conditional waits. I'd guess without all this stuff at least twice as many thread switches were possible a second.

超过 100'000 还不错，即使我们有锁定和条件等待。我猜如果没有所有这些东西，一秒钟内可能有至少两倍的线程切换。

Answer 3

回答by ctacke

You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device.

你无法估计。你需要测量它。它会因设备中的处理器而异。

There are two fairly simple ways to measure a context switch. One involves code, the other doesn't.

有两种相当简单的方法来衡量上下文切换。一个涉及代码，另一个不涉及。

First, the code way (pseudocode):

一、代码方式（伪代码）：

DWORD tick;

main()
{
  HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...);
  tick = QueryPerformanceCounter();
  CeSetThreadPriority(hThread, 10); // real high
  ResumeThread(hThread);
  Sleep(10);
}

ThreadProc()
{
  tick = QueryPerformanceCounter() - tick;
  RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick));
}

Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will).

显然，在循环中进行并求平均值会更好。请记住，这不仅仅是衡量上下文切换。您还在测量对 ResumeThread 的调用，并且不能保证调度程序会立即切换到您的另一个线程（尽管 10 的优先级应该有助于增加它的几率）。

You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find.

您可以通过挂钩调度程序事件来使用 CeLog 获得更准确的测量，但这远非易事且没有很好的文档记录。如果你真的想走那条路，苏洛有几个搜索引擎可以找到的博客。

The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well.

非代码路线是使用远程内核跟踪器。安装 eVC 4.0 或 Platform Builder 的 eval 版本以获取它。它将以图形方式显示内核正在执行的所有操作，您可以使用提供的光标功能直接测量线程上下文切换。同样，我确定 Sue 也有一篇关于使用 Kernel Tracker 的博客条目。

All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.

综上所述，您会发现 CE 进程内线程上下文切换非常非常快。进程切换是昂贵的，因为它需要交换 RAM 中的活动进程，然后进行迁移。

Answer 4

回答by bobah

My 50 lines of C++show for Linux (QuadCore Q6600) the context switch time ~ 0.9us (0.75us for 2 threads, 0.95 for 50 threads). In this benchmark threads call yield immediately when they get a quantum of time.

我的50 行 C++显示 Linux (QuadCore Q6600) 的上下文切换时间 ~ 0.9us（2 个线程为 0.75us，50 个线程为 0.95）。在这个基准测试中，线程在获得一定时间后立即调用 yield。

Answer 5

回答by Soroush

Context Switch is expensive, as a rule of thumb it costs 30μs of CPU overhead http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

上下文切换很昂贵，根据经验，它需要 30μs 的 CPU 开销http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Answer 6

回答by Tim Ring

I've only ever tried to estimate this once and that was on a 486! The upshot was that the processor context switch was taking about 70 instructions to complete (note this was happening for many OS api calls as well as thread switching). We calculated that it was taking approx 30us per thread switch (including OS overhead) on a DX3. The few thousand context switches we were doing per second was absorbing between 5-10% of the processor time.

我只试过估计一次，那是在 486 上！结果是处理器上下文切换需要大约 70 条指令才能完成（请注意，许多 OS api 调用以及线程切换都会发生这种情况）。我们计算出在 DX3 上每个线程切换（包括操作系统开销）大约需要 30us。我们每秒执行的几千次上下文切换占用了 5-10% 的处理器时间。

How that would translate to a multi-core, multi-ghz modern processor I don't know but I would guess that unless you were completely going over the top with thread switching its a negligible overhead.

我不知道这将如何转化为多核、多 ghz 的现代处理器，但我猜想除非您完全超越线程切换的开销，否则它的开销可以忽略不计。

Note that thread creation/deletion is a more expensive CPU/OS hogger than activating/deactivating threads. A good policy for heavily threaded apps is to use thread pools and activate/deactivate as required.

请注意，线程创建/删除比激活/停用线程消耗更多的 CPU/OS。对于重线程应用程序，一个好的策略是使用线程池并根据需要激活/停用。

Answer 7

回答by Atmapuri

The problem with context switches is that they have a fixed time. GPU's implemented 1 cycle context switch between threads. The following for example can not be threaded on CPU's:

上下文切换的问题在于它们有固定的时间。GPU 在线程之间实现了 1 个周期的上下文切换。例如，以下示例不能在 CPU 上线程化：

double * a; 
...
for (i = 0; i < 1000; i ++)
{
    a[i] = a[i] + a[i]
}

because its time of execution is much less than context switch cost. On Core i7 this code takes around 1 micro second (depends on the compiler). So context switch time does matter because it defines how small jobs can be threaded. I guess this also provides a method for effective measurement of context switch. Check how long does the array (in the upper example) has to be so that two threads from thread pool will start showing some real advantage in compare to a single threaded one. This may easily become 100 000 elements and therefore the effective context switch time would be somewhere in the range of 20us within the same app.

因为它的执行时间远低于上下文切换成本。在 Core i7 上，此代码大约需要 1 微秒（取决于编译器）。所以上下文切换时间很重要，因为它定义了可以线程化的小作业。我想这也提供了一种有效测量上下文切换的方法。检查数组（在上面的示例中）必须有多长，以便与单线程线程相比，线程池中的两个线程将开始显示出一些真正的优势。这很容易变成 100 000 个元素，因此有效的上下文切换时间将在同一应用程序中的 20us 范围内。

All the encapsulations used by the thread pool have to be counted to the thread switch time because that is what it all comes down to (at the end).

线程池使用的所有封装都必须计入线程切换时间，因为这就是一切（最后）。

Atmapuri

阿特马普里

Answer 8

回答by bokan

Context Switch is very expensive. Not because of the CPU operation itself, but because of cache invalidation. If you have an intensive task running, it will fill the CPU cache, both for instructions and data, also the memory prefetch, TLB and RAM will optimize the work toward some areas of ram.

上下文切换非常昂贵。不是因为 CPU 操作本身，而是因为缓存失效。如果你有一个密集的任务在运行，它会填满 CPU 缓存，包括指令和数据，内存预取、TLB 和 RAM 将优化内存的某些区域的工作。

When you change context all these cache mechanisms are reset and the new thread start from "blank" state.

当您更改上下文时，所有这些缓存机制都会重置，新线程从“空白”状态开始。

The accepted answer is wrong unless your thread are just incrementing a counter. Of course there is no cache flush involved in this case. There is no point in benchmarking context switching without filling cache like real applications.

除非您的线程只是增加计数器，否则接受的答案是错误的。当然，在这种情况下不涉及缓存刷新。在没有像真实应用程序那样填充缓存的情况下，对上下文切换进行基准测试是没有意义的。

Answer 9

回答by 1800 INFORMATION

I don't know but do you have the usual performance counters in windows mobile? You could look at things like context switches/sec. I don't know if there is one that specifically measures context switch time though.

我不知道，但是您在 Windows Mobile 中是否有常用的性能计数器？您可以查看上下文切换/秒之类的内容。我不知道是否有专门测量上下文切换时间的。

C++ 如何估计线程上下文切换开销？

提问by Ignas Limanauskas

采纳答案by OregonGhost

回答by Mecki

回答by ctacke

回答by bobah

回答by Soroush

回答by Tim Ring

回答by Atmapuri

回答by bokan

回答by 1800 INFORMATION

相关推荐

最近更新

标签

C++ 如何估计线程上下文切换开销？

提问by Ignas Limanauskas

采纳答案by OregonGhost

回答by Mecki

回答by ctacke

回答by bobah

回答by Soroush

回答by Tim Ring

回答by Atmapuri

回答by bokan

回答by 1800 INFORMATION

相关推荐

C++ 模板类继承

C++ 在 Visual Studio 中使用命令行参数进行调试

C++ 在许多情况下，使用 XOR 运算符查找数组中的重复元素会失败

C/C++ 中的图像缩放和旋转

相关推荐

最近更新

标签