Linux C++:如何分析因缓存未命中而浪费的时间?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2486840/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Linux C++: how to profile time wasted due to cache misses?
提问by anon
I know that I can use gprof to benchmark my code.
我知道我可以使用 gprof 来对我的代码进行基准测试。
However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).
但是,我有这个问题——我有一个智能指针,它具有额外的间接级别(将其视为代理对象)。
As a result, I have this extra layer that effects pretty much all functions, and screws with caching.
结果,我有这个额外的层,它影响几乎所有的功能,并带有缓存。
Is there a way to measure the time my CPU wastes due to cache misses?
有没有办法测量我的 CPU 由于缓存未命中而浪费的时间?
Thanks!
谢谢!
回答by Taavi
You could try cachegrindand it's front-end kcachegrind.
您可以尝试使用cachegrind,它是前端 kcachegrind。
回答by Potatoswatter
You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.
您可以找到一个访问 CPU 性能计数器的工具。每个内核中可能都有一个寄存器,用于计算 L1、L2 等未命中数。或者 Cachegrind 执行逐周期模拟。
However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking.No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.
但是,我认为这不是很有见地。您的代理对象大概是由它们自己的方法修改的。传统的分析器会告诉您这些方法花费了多少时间。没有配置文件工具会告诉您如果没有缓存污染源,性能将如何提高。这是减少程序工作集的大小和结构的问题,这不容易推断。
A quick Google search turned up boost::intrusive_ptr
which might interest you. It doesn't appear to support something like weak_ptr
, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.
一个快速的谷歌搜索出现了boost::intrusive_ptr
,你可能会感兴趣。它似乎不支持类似的东西weak_ptr
,但转换你的程序可能是微不足道的,然后你肯定会知道非侵入性引用计数的成本。
回答by Andre Holzner
Linux supports with perf
from 2.6.31 on. This allows you to do the following:
Linuxperf
从 2.6.31 开始支持。这允许您执行以下操作:
- compile your code with -g to have debug information included
- run your code e.g. using the last level cache misses counters:
perf record -e LLC-loads,LLC-load-misses yourExecutable
- run
perf report
- after acknowledging the initial message, select the
LLC-load-misses
line, - then e.g. the first function and
- then
annotate
. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.
- after acknowledging the initial message, select the
- 使用 -g 编译您的代码以包含调试信息
- 运行您的代码,例如使用最后一级缓存未命中计数器:
perf record -e LLC-loads,LLC-load-misses yourExecutable
- 跑
perf report
- 确认初始消息后,选择该
LLC-load-misses
行, - 然后例如第一个函数和
- 然后
annotate
。您应该看到行(在汇编代码中,被原始源代码包围)和一个数字,指示发生缓存未命中的行的最后一级缓存未命中的比例。
- 确认初始消息后,选择该
回答by Krazy Glew
Continuing along the lines of @Mike_Dunlavey's answer:
继续遵循@Mike_Dunlavey 的回答:
First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.
首先,使用您喜欢的工具获取基于时间的配置文件:VTune 或 PTU 或 OProf。
Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...
然后,获取缓存未命中配置文件。L1 缓存未命中,或 L2 缓存未命中,或...
I.e. the first profile associates a "time spent" with each program counter. The second associates a "number of cache misses" value with each program counter.
即,第一个配置文件将“花费的时间”与每个程序计数器相关联。第二个将“缓存未命中数”值与每个程序计数器相关联。
Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.
注意:我经常“减少”数据,按功能或(如果我有技术)按循环总结。或者按 64 个字节的 bin。比较单个程序计数器通常没有用,因为性能计数器是模糊的 - 您看到缓存未命中报告的地方通常是与实际发生的地方不同的几条指令。
OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:
好的,现在绘制这两个配置文件来比较它们。以下是我觉得有用的一些图表:
"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses.Look for places that go both up and down.
“冰山”图表:X 轴是 PC,正 Y 轴是时间,负 Y 轴是缓存未命中。寻找可以上升和下降的地方。
("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)
(“交错”图表也很有用:同样的想法,X 轴是 PC,在 Y 轴上绘制时间和缓存未命中,但具有不同颜色的窄垂直线,通常为红色和蓝色。时间和缓存都很多的地方花费的未命中将有精细交错的红线和蓝线,几乎看起来是紫色的。这延伸到 L2 和 L3 缓存未命中,都在同一张图上。顺便说一句,您可能想要“标准化”这些数字,要么是总数的百分比时间或缓存未命中,或者更好的是,最大数据时间点或缓存未命中的百分比。如果比例错误,您将看不到任何内容。)
XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...
XY 图表:对于每个采样箱(PC、函数、循环或...)绘制一个点,其X 坐标是归一化时间,其 Y 坐标是归一化缓存未命中。如果您在右上角获得大量数据点 - 大 %age 时间和大 %age 缓存未命中 - 这就是有趣的证据。或者,忘记点数 - 如果上角所有百分比的总和很大......
Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)
请注意,不幸的是,您经常必须自己进行这些分析。最后我检查了 VTune 不适合你。我用过 gnuplot 和 Excel。(警告:Excel 在超过 64,000 个数据点时死亡。)
More advice:
更多建议:
If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.
如果你的智能指针是内联的,你可能会得到到处都是计数。在理想的世界中,您将能够将 PC 追溯到源代码的原始行。在这种情况下,您可能需要稍微推迟减少:查看所有个人电脑;将它们映射回源代码行;然后将它们映射到原始函数中。许多编译器,例如 GCC,都有符号表选项允许您执行此操作。
By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.
顺便说一句,我怀疑您的问题不在于导致缓存抖动的智能指针。除非你到处都在做 smart_ptr<int> 。如果您正在执行 smart_ptr<Obj>,并且 sizeof(Obj) + 大于 4*sizeof(Obj*)(并且如果 smart_ptr 本身不是很大),那么它并没有那么多。
More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.
更有可能是智能指针执行的额外间接级别导致了您的问题。
Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like
巧合的是,我在午餐时和一个人谈话,他有一个使用句柄的引用计数智能指针,即一个间接级别,类似于
template<typename T> class refcntptr {
refcnt_handle<T> handle;
public:
refcntptr(T*obj) {
this->handle = new refcnt_handle<T>();
this->handle->ptr = obj;
this->handle->count = 1;
}
};
template<typename T> class refcnt_handle {
T* ptr;
int count;
friend refcnt_ptr<T>;
};
(I wouldn't code it this way, but it serves for exposition.)
(我不会这样编码,但它用于说明。)
The double indirection this->handle->ptrcan be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.
双重间接this->handle->ptr可能是一个很大的性能问题。甚至是三重间接寻址,this->handle->ptr->field。至少,在具有 5 个循环 L1 缓存命中的机器上,每个 this->handle->ptr->field 将花费 10 个循环。并且比单个指针追逐更难重叠。但是,更糟糕的是,如果每个都是 L1 缓存未命中,即使到 L2 仅 20 个周期……嗯,隐藏 2*20=40 个缓存未命中延迟周期比单个 L1 未命中要困难得多。
In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)
一般来说,避免智能指针中的间接级别是一个很好的建议。不是指向一个句柄,所有智能指针都指向一个句柄,它本身指向对象,你可以通过让它指向对象和句柄来使智能指针更大。(这不再是通常所说的句柄,而更像是一个信息对象。)
E.g.
例如
template<typename T> class refcntptr {
refcnt_info<T> info;
T* ptr;
public:
refcntptr(T*obj) {
this->ptr = obj;
this->info = new refcnt_handle<T>();
this->info->count = 1;
}
};
template<typename T> class refcnt_info {
T* ptr; // perhaps not necessary, but useful.
int count;
friend refcnt_ptr<T>;
};
Anyway - a time profile is your best friend.
无论如何 - 时间档案是您最好的朋友。
Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.
哦,是的 - 英特尔 EMON 硬件还可以告诉您在 PC 上等待了多少个周期。这可以区分大量 L1 未命中和少量 L2 未命中。
回答by Paul R
回答by Arthur Kalliokoski
If you're running an AMD processor, you can get CodeAnalyst, apparently free as in beer.
如果您运行的是 AMD 处理器,则可以获得CodeAnalyst,显然是免费的,就像在啤酒中一样。
回答by jpalecek
回答by Mike Dunlavey
Here's kind of a general answer.
这是一个通用的答案。
For example, if your program is spending, say, 50% of it's time in cache misses, then 50% of the time when you pause it the program counter will be at the exact locations where it is waiting for the memory fetches that are causing the cache misses.
例如,如果您的程序花费了 50% 的时间在缓存未命中上,那么当您暂停它时,程序计数器有 50% 的时间将位于它正在等待导致内存获取的确切位置缓存未命中。
回答by Fabien Hure
My advice would be to use PTU(Performance Tuning Utility) from Intel.
我的建议是使用英特尔的PTU(性能调优实用程序)。
This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile. And of course you'll be able to gather all cache line misses events you are looking for.
此实用程序是 VTune 的直接后代,并提供可用的最佳采样分析器。您将能够跟踪 CPU 花费或浪费时间的位置(借助可用的硬件事件),并且这不会降低您的应用程序或配置文件的干扰。当然,您将能够收集您正在寻找的所有缓存行未命中事件。