windows 什么会导致程序第二次运行得更快？

Question

提问by Mason Wheeler

Something I've noticed when testing code I write is that long-running operations tend to run muchlonger the first time a program is run than on subsequent runs, sometimes by a factor of 10 or more. Obviously there's some sort of cold cache/warm cache issue here, but I can't seem to figure out what it is.

事情我已经测试代码，我写的时候注意到的是，长时间操作会比往常多有时的10倍以上更长的第一次程序比在随后的运行中运行。显然这里存在某种冷缓存/热缓存问题，但我似乎无法弄清楚它是什么。

It's not the CPU cache, since these long-running operations tend to be loops that I feed a lot of data to, and they should be fully loaded after the first iteration. (Plus, unloading and reloading the program should clear the cache.)

这不是 CPU 缓存，因为这些长时间运行的操作往往是我向其提供大量数据的循环，并且它们应该在第一次迭代后完全加载。（另外，卸载和重新加载程序应该清除缓存。）

Also, it's not the disc cache. I've ruled that out by loading all data from disc up-front and processing it afterwards, and it's the actual CPU-bound data processing that's going slowly.

此外，它不是磁盘缓存。我已经通过预先从光盘加载所有数据并在之后处理它来排除这种情况，这是实际的 CPU 密集型数据处理速度缓慢。

So what can cause my program to run slow the first time I run it, but then if I close it and run it again, it runs dramatically faster? I've seen this in several different programs that do very different things, so it seems to be a general issue.

那么是什么导致我的程序在我第一次运行时运行缓慢，但是如果我关闭它并再次运行它，它的运行速度会大大加快？我已经在几个不同的程序中看到了这一点，它们做的事情非常不同，所以这似乎是一个普遍的问题。

EDIT: For clarification, I'm writing in Delphi, though I don't really think this is a Delphi-specific issue. But that means that whatever the problem is, it's not related to JIT issues, garbage collection issues, or any of the other baggage that managed code brings with it. And I'm not dealing with network connections. This is pure CPU-bound processing.

编辑：为了澄清起见，我正在用 Delphi 编写，尽管我并不认为这是特定于 Delphi 的问题。但这意味着无论问题是什么，它都与 JIT 问题、垃圾收集问题或托管代码带来的任何其他包袱无关。而且我不处理网络连接。这是纯粹的 CPU 绑定处理。

One example: a script compiler. It runs like this:

一个例子：脚本编译器。它是这样运行的：

Load entire file into memory from disc
Lex the entire file into a queue of tokens
Parse the queue into a tree
Run codegen on the tree to produce bytecode

将整个文件从光盘加载到内存中
将整个文件放入令牌队列中
将队列解析为一棵树
在树上运行 codegen 以生成字节码

If I feed it an enormous script file (~100k lines,) after loading the entire thing from disc into memory, the lex step takes about 15 seconds the first time I run, and 2 seconds on subsequent runs. (And yes, I know that's still a long time. I'm working on that...) I'd like to know where that slowdown is coming from and what I can do about it.

如果我在将整个内容从光盘加载到内存中后给它提供一个巨大的脚本文件（~100k 行），那么 lex 步骤在我第一次运行时需要大约 15 秒，在随后的运行中需要 2 秒。（是的，我知道这还很长一段时间。我正在努力……）我想知道放缓的根源以及我能做些什么。

Answer 1

回答by Eric Grange

Three things to try:

尝试三件事：

Run it in a sampling profiler, including a "cold" run (first thing after a reboot). Should usually be enough.
Check memory usage, does it grow so high (even transiently) the OS would have to swap things out of RAM to make room for your app? That alone could be an explanation for what you're seeing. Also look at the amount of free RAM you have when you start your app.
Enable system performance tools and check the I/O counters or file accesses, and make sure under FileMon / Process Explorer that you don't have some file or network accesses you've forgotten about (leftover log/test code)

在采样分析器中运行它，包括“冷”运行（重新启动后的第一件事）。通常应该足够了。
检查内存使用情况，它是否增长得如此之高（甚至是暂时的），以至于操作系统必须将内容从 RAM 中交换出来才能为您的应用程序腾出空间？仅此一项就可以解释您所看到的情况。还要查看启动应用程序时的可用 RAM 量。
启用系统性能工具并检查 I/O 计数器或文件访问，并确保在 FileMon / Process Explorer 下您没有忘记某些文件或网络访问（剩余的日志/测试代码）

Answer 2

回答by Steve314

Even (especially) for very small command-line program, the issue can be the time it takes to load the process, link to dynamically-linked libraries etc. I believe modern operating systems avoid repeating a lot of this work if the same program is run twice at once, or repeatedly.

即使（尤其是）对于非常小的命令行程序，问题也可能是加载进程、链接到动态链接库等所需的时间。我相信如果同一个程序是现代操作系统会避免重复很多这项工作一次运行两次，或重复运行。

I wouldn't dismiss CPU cache so easily, as well. Level 0 cache is very relevant for inner loops, but much less so for a second run of the same application. On my cheap Athlon 2 X4 645 system, there's 64K + 64K (data + instruction) level 0 cache per core - not exactly a huge amount of memory. Level 1 cache is IIRC 512K per core, so less likely to be dirtied to complete irrelevance by the O/S code needed to start up a new run of the program, calls to operating system services and standard libraries, etc. Level 2 cache (on CPUs that have it - my Athlon 2 doesn't, IIRC) is larger still, and there may be some even higher level and larger cache provided by the motherboard/chipset.

我也不会那么容易地关闭 CPU 缓存。0 级缓存与内部循环非常相关，但对于同一应用程序的第二次运行则不然。在我便宜的 Athlon 2 X4 645 系统上，每个内核有 64K + 64K（数据 + 指令）0 级缓存 - 并不是很大的内存量。一级缓存是每个内核的 IIRC 512K，因此不太可能被启动新程序运行所需的 O/S 代码、调用操作系统服务和标准库等弄脏以完全无关。二级缓存（在拥有它的 CPU 上 - 我的 Athlon 2 没有，IIRC）仍然更大，并且主板/芯片组可能提供更高级别和更大的缓存。

There's at least one other kind of cache - branch prediction tables. Though I'd have thought they'd be dirtied to irrelevance even quicker than the level 0 cache.

至少还有另一种缓存——分支预测表。尽管我原以为它们会比 0 级缓存更快地变得无关紧要。

I generally find that unit test programs run many times slower the first time. However, the larger and more complex the program, the less significant the effect.

我通常发现单元测试程序第一次运行速度要慢很多倍。但是，程序越大越复杂，效果就越不显着。

For some time now, performance of applications has often been considered non-deterministic. Although it isn't strictly true, the performance is determined by so many hard-to-predict factors that it's a good model. For example, if the CPU is a bit warm, the clock speed may be reduced to prevent overheating. And the temperature varies at different parts of the chip, with changes conducting across the chip in complex ways. As changes in clock speed and the different demands of different pieces of code alter the patterns of changing temperature, there's a clear potential for chaotic (as in chaos theory) behaviour.

一段时间以来，应用程序的性能通常被认为是不确定的。虽然它不是严格正确的，但性能是由许多难以预测的因素决定的，因此它是一个很好的模型。例如，如果 CPU 有点热，则可能会降低时钟速度以防止过热。并且芯片不同部分的温度不同，变化以复杂的方式在芯片上传导。随着时钟速度的变化和不同代码段的不同需求改变了温度变化的模式，混沌（如混沌理论中的）行为的潜在可能性是明显的。

On some platforms, I wouldn't be surprised if the first run of the program got the processor to run if it's "fast" (rather than cool/quiet) mode, and that meant that the beginning of the second run benefitted from that speed boost as well as the end. However, this would be a tricky one - it would have to be a CPU-intensive program, and if your cooling is inadequate, the processor may then slow down again to avoid overheating.

在某些平台上，如果程序的第一次运行让处理器在“快速”（而不是冷/安静）模式下运行，我不会感到惊讶，这意味着第二次运行的开始受益于该速度提升以及结束。然而，这将是一个棘手的问题——它必须是一个 CPU 密集型程序，如果您的冷却不足，处理器可能会再次减速以避免过热。

Answer 3

回答by TMN

I'd guess it's all your libraries/DLLs. These are usually loaded on-demand at run-time, so the first time your program runs the OS will have to read them all from disk. Once read, though, they'll stay loaded unless your system starts running low on memory. So if you run the same program several times in succession, the first run takes the brunt of the load time, and the other runs benefit from the pre-loaded libraries.

我猜这是你所有的库/DLL。这些通常在运行时按需加载，因此您的程序第一次运行时，操作系统必须从磁盘读取它们。但是，一旦读取，它们将保持加载状态，除非您的系统开始运行内存不足。因此，如果连续多次运行同一个程序，第一次运行将首当其冲，而其他运行则受益于预加载的库。

Answer 4

回答by Arnaud Bouchez

I usually experienced the contrary: for computation intensitive work (if anti virus is not working), I only have a 5-10% diff between calls. For instance, the 6,000,000 regression tests run for our framework have a very constant time of running, and it's very disk and CPU intensive work.

我经常遇到相反的情况：对于计算密集型工作（如果防病毒不起作用），我在调用之间只有 5-10% 的差异。例如，为我们的框架运行的 6,000,000 次回归测试的运行时间非常恒定，这是非常占用磁盘和 CPU 的工作。

I really don't believe of a CPU cache or pipelining / branch prediction issue either, since both processed data and code seem to be consistent, as you wrote. If anti virus is off, it may be about OS thread settings: did you try to change the process CPU affinity and priority?

我也真的不相信 CPU 缓存或流水线/分支预测问题，因为正如您所写的，处理过的数据和代码似乎是一致的。如果防病毒关闭，则可能与操作系统线程设置有关：您是否尝试更改进程 CPU 关联性和优先级？

This should be very specific to the process you are running. Without any actual source code to reproduce it, it's almost impossible to tell what's happening with you. How many threads are there? What is the HW configuration (isn't there any Intel CPU boost there - are you using a laptop, and what are your energy settings)? Is it using CPU/FPU/MMX/SSE2 (e.g. MMX and FPU do not mix)? Does it move a lot of data, or process some existing data? Does your SW depends on external libraries (even some Windows libraries may need some time to initialize)? How do you use memory (did you try to pre-allocate the memory; or on a multi-threaded application, did you try using a scaling MMinstead of FastMM4)?

这应该非常特定于您正在运行的进程。没有任何实际的源代码来重现它，几乎不可能告诉你发生了什么。有多少线程？什么是硬件配置（那里没有任何英特尔 CPU 提升 - 您是否使用笔记本电脑，以及您的能源设置是什么）？是否使用CPU/FPU/MMX/SSE2（例如MMX 和FPU 不混用）？它是否移动了大量数据，或处理了一些现有数据？您的软件是否依赖于外部库（甚至某些 Windows 库可能需要一些时间来初始化）？您如何使用内存（您是否尝试预先分配内存；或者在多线程应用程序中，您是否尝试使用缩放 MM而不是 FastMM4）？

I think using a sample profiler may not help so much, since it will change the general CPU core use, but it's worth trying in all cases. I'd better rely on logging profiling - see e.g. this classor you may write your own timestamps to find where the timing changes in your app.

我认为使用示例分析器可能没有太大帮助，因为它会改变一般 CPU 内核的使用，但在所有情况下都值得尝试。我最好依赖日志分析 - 例如参见此类，或者您可以编写自己的时间戳来查找应用程序中时间变化的位置。

AFAIK it has always been written that, when benchmarking, the first run of an application shall never be taken in account. Computer systems are so complex nowadays, that the first time, all the internal (SW and HW) plumbing is to be purged - so you shall not drink the first water coming out of your tap when you come back from 1 month of travel. ;)

AFAIK 一直写道，在进行基准测试时，永远不应考虑应用程序的第一次运行。当今的计算机系统如此复杂，以至于第一次要清除所有内部（软件和硬件）管道 - 因此，当您旅行 1 个月回来时，您不应喝水龙头中流出的第一口水。;)

Answer 5

回答by PatrickvL

Other factors I can think of would be memory-alignment (and the subsequent cache line fills), but say there are 2 types : perfect alignment (being fastest) and imperfect (being slower), one would expect it to occur irregularly (depending on how memory is laid out).

我能想到的其他因素是内存对齐（以及随后的缓存行填充），但是说有两种类型：完美对齐（最快）和不完美（速度较慢），人们会期望它不规则地发生（取决于内存是如何布置的）。

Perhaps it has something to do with physical page layout? As far as I know, each memory-access goes through the MMU page table entries, so dispersed physical pages could be slower than consecutive pages. (Just a wild guess, this one)

也许它与物理页面布局有关？据我所知，每次内存访问都通过 MMU 页表条目，因此分散的物理页面可能比连续页面慢。（只是一个疯狂的猜测，这个）

Another thing I haven't seen mentioned yet, is on which core(s) your process is running - especially on hyper-threaded CPU's, running on the slower of the two cores might have a negative impact. Try setting the processor affinity mask on one and the same core for every run, and see if that impacts the measured runtime differences between first and subsequent runs.

我还没有看到提到的另一件事是，您的进程在哪个或哪些内核上运行 - 特别是在超线程 CPU 上，在两个内核中较慢的内核上运行可能会产生负面影响。尝试为每次运行在同一个内核上设置处理器关联掩码，看看这是否会影响第一次和后续运行之间测得的运行时间差异。

By the way - how do you define 'first run'? Could it be that you've just compiled the executable? In that case (and I'm just guessing again here), some process (either the OS, a virus-scanner, or even some root-kit) might be busy analyzing your executable's behaviour, which might be skipped once the executable has been analyzed before. You could try to prove that by changing some random unimportant byte of your executable between runs, and see if that impacts the runtime negatively again?

顺便说一句 - 你如何定义“第一次运行”？可能是你刚刚编译了可执行文件？在这种情况下（我只是在这里再次猜测），某些进程（操作系统、病毒扫描程序，甚至某些 root-kit）可能正忙于分析您的可执行文件的行为，一旦可执行文件被跳过，这些进程可能会被跳过之前分析过。您可以尝试通过在运行之间更改可执行文件的一些随机不重要字节来证明这一点，并查看这是否再次对运行时产生负面影响？

Please post a summary once you figured out the cause(s) - this might help others too. Cheers!

请在找出原因后发布摘要 - 这也可能对其他人有所帮助。干杯!

Answer 6

回答by Ken Bourassa

Just a random guess...

只是随机猜测...

Does your processor support adaptive frequency? Maybe it's just the processor that doesn't have time to adapt its frequency on the first run, and is running full speed on second one.

您的处理器是否支持自适应频率？也许只是处理器在第一次运行时没有时间调整其频率，而在第二次运行时全速运行。

Answer 7

回答by MusiGenesis

There are lots of things that can cause this. Just as one example: if you're using ADO.NETfor data access with connection pooling turned on (which is the default), the first time your application runs it will take the hit of creating the database connection. When your app is closed, the connection is maintained in its open state by ADO.NET, so the next time your app runs and does data access it will not have to take the hit of instantiating the connection, and thus will appear faster.

有很多事情会导致这种情况。举一个例子：如果您在ADO.NET连接池打开的情况下使用数据访问（这是默认设置），那么您的应用程序第一次运行时将受到创建数据库连接的影响。当您的应用程序关闭时，连接由保持在其打开状态ADO.NET，因此下次您的应用程序运行并进行数据访问时，它不必承担实例化连接的麻烦，因此会显示得更快。

Answer 8

回答by Peter

Guessing your using .net if im wrong you could ignore most of my ideas...

猜测你使用 .net 如果我错了，你可以忽略我的大部分想法......

Connection pooling, JIT compilation, reflection, IO Caching the list goes on and on....

连接池、JIT 编译、反射、IO 缓存等等……

Try testing smaller portions of the code to see what parts change performance the most...

尝试测试代码的较小部分，看看哪些部分对性能的影响最大...

You could try ngen'ing your assemblies as this removes the JIT compilation.

您可以尝试生成程序集，因为这会删除 JIT 编译。

Answer 9

回答by TridenT

where that slowdown is coming from and what I can do about it.

这种放缓来自哪里以及我能做些什么。

I would speak about quick execution the next times can from from performance caching

我将谈论下一次可以从性能缓存中快速执行

Disk internal cache (8MB or more)
Windows applicationDependencies (as DLL)/Core cache
CPU cache L3 (or L2 if some programming loop are small enough)

磁盘内部缓存（8MB 或更多）
Windows 应用程序依赖项（作为 DLL）/核心缓存
CPU 缓存 L3（如果某些编程循环足够小，则为 L2）

So you see that the first time you do not benefits from these caching systems.

所以你第一次看到你没有从这些缓存系统中受益。

windows 什么会导致程序第二次运行得更快？

提问by Mason Wheeler

回答by Eric Grange

回答by Steve314

回答by TMN

回答by Arnaud Bouchez

回答by PatrickvL

回答by Ken Bourassa

回答by MusiGenesis

回答by Peter

回答by TridenT

相关推荐

最近更新

标签

windows 什么会导致程序第二次运行得更快？

提问by Mason Wheeler

回答by Eric Grange

回答by Steve314

回答by TMN

回答by Arnaud Bouchez

回答by PatrickvL

回答by Ken Bourassa

回答by MusiGenesis

回答by Peter

回答by TridenT

相关推荐

windows bat - ECHO 在 txt 文件中关闭

windows printf 未知说明符 %S

windows 会有 Win64 API 吗？

windows 为什么 LogonUser(...) 不适用于域帐户？

相关推荐

最近更新

标签