Linux 如何导致指令缓存未命中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10588567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 06:19:13  来源:igfitidea点击:

How can I cause an instruction cache miss?

clinuxperformancemicroprocessors

提问by William the Pleaser

I've been tasked with generating a certain number of data-cache misses and instruction-cache misses. I've been able to handle the data-cache portion without issue.

我的任务是生成一定数量的数据缓存未命中和指令缓存未命中。我已经能够毫无问题地处理数据缓存部分。

So I'm left with generating the instruction-cache misses. I do not have any idea what causes these. Can someone suggest a method of generating them?

所以我只剩下生成指令缓存未命中。我不知道是什么导致了这些。有人可以建议一种生成它们的方法吗?

I'm using GCC in Linux.

我在 Linux 中使用 GCC。

回答by selalerer

A chain of if else on unpredictable conditions (e.g. input or randomly generated data) with amount of instructions both in the if case and in the else case which size is larger than a cache line.

在不可预测的条件下(例如输入或随机生成的数据)的 if else 链,在 if 和 else 情况下的指令数量都大于缓存线。

回答by AShelly

For instruction cache misses, you need to execute code segments that are far apart. Splitting your logic among multiple function calls would be one way to do that.

对于指令缓存未命中,您需要执行相距很远的代码段。在多个函数调用之间拆分您的逻辑将是一种方法。

回答by gbulmer

As people have explained, an instruction cache miss is conceptually the same as a data-cache miss - the instructions are not in the cache. This is because the processor's program counter (PC) has jumped to a place which hasn't been loaded into the cache, or has been flushed out because the cache got filled, and that cache line was the one chosen for eviction (usually least recently used).

正如人们所解释的,指令缓存未命中在概念上与数据缓存未命中相同——指令不在缓存中。这是因为处理器的程序计数器 (PC) 已跳转到尚未加载到缓存中的位置,或者由于缓存已满而已被刷新,并且该缓存行是选择用于逐出的行(通常是最近最少的)用过的)。

It is a bit harder to generate enough code by hand to force an instruction miss than it is to force a data cache miss.

手动生成足够的代码来强制指令未命中比强制数据缓存未命中要困难一些。

One way to get lots of code, for little effort, is to write a program which generates source code.

获得大量代码的一种方法是编写一个生成源代码的程序。

For example write a program to generate a function with a huge switch statement (in C) [Warning, untested]:

例如编写一个程序来生成一个带有巨大 switch 语句的函数(在 C 中)[警告,未经测试]:

printf("void bigswitch(int n) {\n    switch (n) {");
for (int i=1; i<100000; ++i) {
    printf("        case %d: n += %d;\n", n, n+i/2);
}
printf("    }\n    return n;}\n");

Then you can call this from another function, and you can control how big a jump along the cache line it takes.

然后你可以从另一个函数调用它,你可以控制它沿着缓存行跳跃的大小。

A property of a switch statement is the code can be forced to execute backwards, or in patterns by choosing the parameter. So you can work with the pre-fetching and prediction mechanisms, or try to work against them.

switch 语句的一个属性是可以强制向后执行代码,或者通过选择参数以模式执行。所以你可以使用预取和预测机制,或者尝试对抗它们。

The same technique could be applied to generate lots of functions too, to ensure the cache can be 'busted' at will. So you may have bigswitch001, bigswitch002, etc. You might call this using a switch which you also generate.

同样的技术也可以应用于生成许多函数,以确保可以随意“破坏”缓存。所以你可能有 bigswitch001、bigswitch002 等。你可以使用你也生成的开关来调用它。

If you can make each function (approximately) some number of i-cache lines in size, and also generate more functions than will fit in cache, then the problem of generating instruction cache-misses becomes easier to control.

如果您可以使每个函数(大约)具有一定数量的 i-cache 行,并且生成的函数多于适合缓存的函数,那么生成指令缓存未命中的问题就变得更容易控制。

You can see exactly how big a function, an entire switch statement, or each leg of a switch statement is by dumping the assembler (using gcc -S), or objdump the .o file. So you could 'tune' the size of a function by adjusting the number of case:statements. You could also choose how many cache lines are hit, by judicious choice of the parameter to bigswitchNNN().

通过转储汇编程序(使用 gcc -S)或 objdump .o 文件,您可以确切地看到函数、整个 switch 语句或 switch 语句的每个分支有多大。因此,您可以通过调整case:语句的数量来“调整”函数的大小。您还可以通过明智地选择 bigswitchNNN() 的参数来选择命中的缓存行数。

回答by Crashworks

In addition to all the other ways mentioned here, another very reliable way to force an instruction cache miss is to have self-modifying code.

除了这里提到的所有其他方法之外,另一种强制指令缓存未命中的非常可靠的方法是使用自修改代码。

If you write to a page of code in memory (assuming you configured the OS to permit this), then of course the corresponding line of instruction cache immediately becomes invalid, and the processor is forced to refetch it.

如果您写入内存中的一页代码(假设您将操作系统配置为允许这样做),那么相应的指令缓存行当然会立即无效,并且处理器将被迫重新获取它。

It is not branch predictionthat causes an icache miss, by the way, but simply branching. You miss instruction cache whenever the processor tries to run an instruction that has not recently been run. Modern x86 is smart enough to prefetch instructions in sequence, so you are very unlikely to miss icache by just ordinary walking forward from one instruction to the next. But any branch (conditional or otherwise) jumps to a new address out of sequence. If the new instruction address hasn't been run recently, and isn't near the code you were already running, it is likely to be out of cache, and the processor must stop and wait for the instructions to come in from main RAM. This is exactly like data cache.

顺便说一下,导致 icache 未命中的不是分支预测,而只是分支。每当处理器尝试运行最近未运行的指令时,您就会错过指令缓存。现代 x86 足够智能,可以按顺序预取指令,因此您不太可能通过从一条指令到下一条指令的普通前进而错过 icache。但是任何分支(有条件的或其他的)都会乱序跳转到一个新地址。如果新指令地址最近没有运行,并且不在您已经运行的代码附近,则它可能超出缓存,处理器必须停止并等待来自主 RAM 的指令。这与数据缓存完全一样。

Some very modern processors (recent i7) are able to look at upcoming branches in code and start the icache prefetching the possible targets, but many cannot (video game consoles). Fetching data from main RAM to icache is totally different from the "instruction fetching" stage of the pipeline, which is what branch predictionis about.

一些非常现代的处理器(最近的 i7)能够查看代码中即将出现的分支并启动 icache 预取可能的目标,但许多不能(视频游戏机)。从主 RAM 到 icache 获取数据与流水线的“指令获取”阶段完全不同,后者是分支预测的内容。

"Instruction fetch" is part of the CPU's execution pipeline, and refers to bringing an opcode from icache into the CPU's execution unit, where it can start decoding and doing work. That is different from "instruction cache" fetching, which must happen many cycles earlier and involves the cache circuitry making a request to the main memory unit to send some bytes across the bus. The first is an interaction between two stages of the CPU pipeline. The second is an interaction between the pipeline and the memory cache and main RAM, which is a much more complicated piece of circuitry. The names are confusingly similar, but they're totally separate operations.

“指令获取”是 CPU 执行管道的一部分,指的是将操作码从 icache 带入 CPU 的执行单元,在那里它可以开始解码并执行工作。这与“指令缓存”获取不同,后者必须提前许多周期发生,并且涉及缓存电路向主存储单元发出请求以通过总线发送一些字节。第一个是 CPU 流水线两个阶段之间的交互。第二个是管道与内存缓存和主 RAM 之间的交互,这是一个复杂得多的电路。这些名称令人困惑地相似,但它们是完全独立的操作。

So one other way to cause instruction cache misses would be to write (or generate) lots of really big functions, so that your code segment is huge. Then call wildly from one function to another, so that from the CPU's point of view you are doing crazy GOTOs all over memory.

因此,导致指令缓存未命中的另一种方法是编写(或生成)大量非常大的函数,因此您的代码段非常庞大。然后从一个函数疯狂地调用另一个函数,因此从 CPU 的角度来看,您正在对整个内存进行疯狂的 GOTO。

回答by phonetagger

Your project requires an awareness of your target system's cache hardware, including but not limited to its cache size (the overall size of the cache), cache line size (smallest cacheable entity), associativity, and write & replacement policies. Any really good algorithm designed to test a cache's performance must take all of this into account, as there is no single general algorithm that would effectively test all cache configurations, though you may be able to design an effective parameterized test routine generator, which might generate a suitable test routine given enough of the particulars about a given target's cache architecture. Despite this, I think my suggestion below is a pretty good general-case test, but first I wanted to mention:

您的项目需要了解目标系统的缓存硬件,包括但不限于其缓存大小(缓存的整体大小)、缓存行大小(最小的可缓存实体)、关联性以及写入和替换策略。任何旨在测试缓存性能的真正好的算法都必须考虑所有这些,因为没有单一的通用算法可以有效地测试所有缓存配置,尽管您可以设计一个有效的参数化测试例程生成器,它可能会生成一个合适的测试例程,提供足够的关于给定目标的缓存架构的细节。尽管如此,我认为我下面的建议是一个很好的一般情况测试,但首先我想提一下:

You mention that you have a working data cache test that uses a “large integer array a[100].... [which accesses] the elements in such a way that the distance between the two elements is greater than the cache-line size(32 bytes in my case).” I am curious how you've determined that your test algorithm works and how you've determined how many data cache misses are a result of your algorithm, as opposed to misses caused by other stimuli. Indeed, with a test array of 100*sizeof(int), your test data area is only 400 bytes long on most general-purpose platforms today (perhaps 800 bytes if you're on a 64-bit platform, or 200 bytes if you're using a 16-bit platform). For the vast majority of cache architectures, that entire test array will fit into the cache many times over, meaning that randomized accesses to the array will bring the entire array into the cache in somewhere around (400/cache_line_size)*2 accesses, and every access after that will be a cache hit regardless of how you order your accesses, unless some hardware or OS tick timer interrupt pops in and flushes out some or all of your cached data.

您提到您有一个工作数据缓存测试,该测试使用“大整数数组 a[100].... [访问] 元素的方式是两个元素之间的距离大于缓存行大小(在我的例子中是 32 个字节)。” 我很好奇您如何确定您的测试算法有效,以及您如何确定有多少数据缓存未命中是您的算法导致的,而不是由其他刺激引起的未命中。实际上,对于 100*sizeof(int) 的测试数组,您的测试数据区在当今大多数通用平台上只有 400 字节长(如果您在 64 位平台上,则可能为 800 字节,如果您使用,则可能为 200 字节) '正在使用 16 位平台)。对于绝大多数缓存架构,整个测试阵列将多次放入缓存中,

With regard to the instruction cache: Others have suggested using a large switch()-case statement or function calls to functions in disparate locations, neither of which would be predictably effective without carefully (and I mean CAREFULLY) designing the size of the code in the respective case branches or locations & sizes of the disparately-located functions. The reason for this is that bytes throughout memory “fold into” (technically, “alias one another” in) the cache in a totally predictable pattern. If you carefully control the number of instructions in each branch of a switch()-case statement, you might be able to get somewhere with your test, but if you just throw a large indiscriminate amount of instructions in each, you have no idea how they will fold into the cache and which cases of the switch()-case statement alias each other in order to use them to evict each other out of the cache.

关于指令缓存:其他人建议对不同位置的函数使用大型 switch()-case 语句或函数调用,如果不仔细(我的意思是仔细)设计代码的大小,这两种方法都不会有效不同位置的功能的相应案例分支或位置和大小。这样做的原因是整个内存中的字节以一种完全可预测的模式“折叠到”(技术上,“彼此别名”)缓存中。如果您仔细控制 switch()-case 语句的每个分支中的指令数量,您可能会在测试中取得成功,但是如果您只是在每个分支中随意抛出大量指令,

I'm guessing you're not overly familiar with assembly code, but you've gotta believe me here, this project is screaming for it. Trust me, I'm not one to use assembly code where it's not called for, and I strongly prefer programming in OO C++, using STL & polymorphic ADT hierarchies whenever possible. But in your case, there's really no other foolproof way of doing it, and assembly will give you the absolute control over code block sizes that you really need in order to be able to effectively generate specified cache hit ratios. You wouldn't have to become an assembly expert, and you probably woudn't even need to learn the instructions & structure required to implement a C-language prologue & epilogue (Google for “C-callable assembly function”). You write some extern “C” function prototypes for your assembly functions, and away you go. If you do care to learn some assembly, the more of the test logic you put in the assembly functions, the less of a “Heisenberg effect” you impose on your test, since you can carefully control where the test control instructions go (and thus their effect on the instruction cache). But for the bulk of your test code, you can just use a bunch of “nop” instructions (the instruction cache doesn't really care what instructions it contains), and probably just put your processor's "return" instruction at the bottom of each block of code.

我猜你对汇编代码并不太熟悉,但你必须在这里相信我,这个项目正在为它尖叫。相信我,我不会在不需要的地方使用汇编代码,而且我非常喜欢在 OO C++ 中编程,尽可能使用 STL 和多态 ADT 层次结构。但是在您的情况下,确实没有其他万无一失的方法,并且程序集将使您能够绝对控制您真正需要的代码块大小,以便能够有效地生成指定的缓存命中率。您不必成为汇编专家,甚至可能不需要学习实现 C 语言序言和结尾所需的指令和结构(谷歌表示“C 可调用汇编函数”)。您为汇编函数编写了一些外部“C”函数原型,然后就可以了。如果你真的想学习一些汇编,你在汇编函数中加入的测试逻辑越多,你对测试施加的“海森堡效应”就越少,因为你可以小心地控制测试控制指令的去向(从而它们对指令缓存的影响)。但是对于大部分测试代码,您可以只使用一堆“nop”指令(指令缓存并不真正关心它包含哪些指令),并且可能只需将处理器的“返回”指令放在每个指令的底部代码块。因为您可以仔细控制测试控制指令的去向(以及它们对指令缓存的影响)。但是对于大部分测试代码,您可以只使用一堆“nop”指令(指令缓存并不真正关心它包含哪些指令),并且可能只需将处理器的“返回”指令放在每个指令的底部代码块。因为您可以仔细控制测试控制指令的去向(以及它们对指令缓存的影响)。但是对于大部分测试代码,您可以只使用一堆“nop”指令(指令缓存并不真正关心它包含哪些指令),并且可能只需将处理器的“返回”指令放在每个指令的底部代码块。

Now let's say your instruction cache is 32K (pretty darn small by today's standards, but perhaps still common in many embedded systems). If your cache is 4-way associative, you can create eight separate content-identical 8K assembly functions (which you hopefully noticed is 64K worth of code, twice the size of the cache), the bulk of which is just a bunch of NOP instructions. You make them all fall one after the other in memory (generally by simply defining each one-after-the-other in the source file). Then you call them from a test control function using carefully computed sequences to generate any cache hit ratio you desire (with rather course granularity since the functions are each a full 8K long). If you call the 1st, 2nd, 3rd, and 4th functions one after another, you know you've filled the entire cache with those test functions' code. Calling any of those again at this point will not result in an instruction cache miss (with the exception of lines evicted by the test control function's own instructions), but calling any of the other (5th, 6th, 7th, or 8th; let's just choose the 5th) will evict one of the others (though which one is evicted depends on your cache's replacement policy). At this point, the only one you can call and know you WON'T evict another is the one you just called (the 5th one), and the only ones you can call and know you WILL evict another is one you haven't yet called (the 6th, 7th, or 8th). To make this easier, just maintain a static array sized the same as the number of test functions you have. To trigger an eviction, call the function at the end of the array & move its pointer to the top of the array, shifting the others down. To NOT trigger an eviction, call the one you most recently called (the one at the top of the array; be sure to NOT shift the others down in this case!). Do some variations on this (perhaps make 16 separate 4K assembly functions) if you need finer granularity. Of course all of this depends on the test control logic size being insignificant in comparison to the size of each associative “way” of the cache; for more positive control, you could put the test control logic in the test functions themselves, but for perfect control you'd have to design the control logic entirely without internal branching (only branching at the end of each assembly function), but I think I'll stop here since that's probably over-complicating things.

现在假设您的指令缓存是 32K(以今天的标准来看非常小,但在许多嵌入式系统中可能仍然很常见)。如果您的缓存是 4 路关联的,您可以创建八个单独的内容相同的 8K 汇编函数(您希望注意到的是 64K 的代码,是缓存大小的两倍),其中大部分只是一堆 NOP 指令. 你让它们在内存中一个接一个地下降(通常通过简单地在源文件中一个接一个地定义)。然后,您使用精心计算的序列从测试控制函数中调用它们,以生成您想要的任何缓存命中率(由于每个函数都有完整的 8K 长,因此具有相当的粒度)。如果你一个接一个地调用第一个、第二个、第三个和第四个函数,你就知道了 已经用这些测试函数的代码填充了整个缓存。此时再次调用其中任何一个都不会导致指令缓存未命中(测试控制功能自己的指令驱逐的行除外),而是调用其他任何一个(第 5、第 6、第 7 或第 8;让我们只是选择第 5 个)将驱逐其他之一(尽管驱逐哪个取决于您的缓存的替换策略)。在这一点上,唯一你可以打电话并知道你不会驱逐另一个的是你刚刚打电话的那个(第 5 个),唯一你可以打电话并知道你会驱逐另一个的是你还没有称为(第 6 个、第 7 个或第 8 个)。为了使这更容易,只需维护一个大小与您拥有的测试函数数量相同的静态数组。要触发驱逐,请调用数组末尾的函数 & 将其指针移动到数组的顶部,将其他指针向下移动。要不触发驱逐,请调用您最近调用的那个(数组顶部的那个;在这种情况下,请确保不要将其他的下移!)。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。将其他人向下移动。要不触发驱逐,请调用您最近调用的那个(数组顶部的那个;在这种情况下,请确保不要将其他的下移!)。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。将其他人向下移动。要不触发驱逐,请调用您最近调用的那个(数组顶部的那个;在这种情况下,请确保不要将其他的下移!)。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。调用您最近调用的那个(数组顶部的那个;在这种情况下一定不要将其他的向下移动!)。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。调用您最近调用的那个(数组顶部的那个;在这种情况下一定不要将其他的向下移动!)。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。如果您需要更精细的粒度,请对此进行一些更改(可能制作 16 个单独的 4K 汇编函数)。当然,所有这一切都取决于测试控制逻辑的大小与缓存的每个关联“路”的大小相比是微不足道的;为了更积极的控制,你可以把测试控制逻辑放在测试函数本身中,但是为了完美的控制,你必须完全没有内部分支(只在每个汇编函数的末尾分支)设计控制逻辑,但我认为我会停在这里,因为这可能会使事情变得过于复杂。

Off-the-cuff & not tested, the entirety of one of the assembly functions for x86 might look like this:

即用即用且未经测试,x86 的整个汇编函数之一可能如下所示:

myAsmFunc1:
   nop
   nop
   nop  # ...exactly enough NOPs to fill one "way" of the cache
   nop  # minus however many bytes a "ret" instruction is (1?)
   .
   .
   .
   nop
   ret  # return to the caller

For PowerPC it might look like this (also untested):

对于 PowerPC,它可能看起来像这样(也未经测试):

myAsmFunc1:
   nop
   nop
   nop   # ...exactly enough NOPs to fill one "way" of the cache
   .     # minus 4 bytes for the "blr" instruction.  Note that
   .     # on PPC, all instructions (including NOP) are 4 bytes.
   .
   nop
   blr   # return to the caller

In both cases, the C++ and C prototypes for calling these functions would be:

在这两种情况下,调用这些函数的 C++ 和 C 原型是:

extern "C" void myAsmFunc1();    // Prototype for calling from C++ code
void myAsmFunc1(void);           /* Prototype for calling from C code */

Depending on your compiler, you might need an underscore in front of the function name in the assembly code itself (but not in your C++/C function prototype).

根据您的编译器,您可能需要在汇编代码本身的函数名称前加一个下划线(但在 C++/C 函数原型中不需要)。