C# Try-catch 加速我的代码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8928403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Try-catch speeding up my code?
提问by Eren Ers?nmez
I wrote some code for testing the impact of try-catch, but seeing some surprising results.
我写了一些代码来测试 try-catch 的影响,但看到了一些令人惊讶的结果。
static void Main(string[] args)
{
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
long start = 0, stop = 0, elapsed = 0;
double avg = 0.0;
long temp = Fibo(1);
for (int i = 1; i < 100000000; i++)
{
start = Stopwatch.GetTimestamp();
temp = Fibo(100);
stop = Stopwatch.GetTimestamp();
elapsed = stop - start;
avg = avg + ((double)elapsed - avg) / i;
}
Console.WriteLine("Elapsed: " + avg);
Console.ReadKey();
}
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
return fibo;
}
On my computer, this consistently prints out a value around 0.96..
在我的计算机上,这始终打印出大约 0.96 的值。
When I wrap the for loop inside Fibo() with a try-catch block like this:
当我用像这样的 try-catch 块将 for 循环包裹在 Fibo() 中时:
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
try
{
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
}
catch {}
return fibo;
}
Now it consistently prints out 0.69... -- it actually runs faster! But why?
现在它始终打印出 0.69...——它实际上运行得更快!但为什么?
Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).
注意:我使用 Release 配置编译它并直接运行 EXE 文件(在 Visual Studio 之外)。
EDIT: Jon Skeet's excellentanalysisshows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using inttypes inside the Fibo method instead of longtypes, and then the x86 CLR was as equally fast as the x64 CLR.
编辑:Jon Skeet 的出色分析表明,在这种特定情况下,try-catch 以某种方式导致 x86 CLR 以更有利的方式使用 CPU 寄存器(我认为我们还没有理解为什么)。我证实了 Jon 的发现,即 x64 CLR 没有这种差异,而且它比 x86 CLR 更快。我还在intFibo 方法中使用类型而不是long类型进行了测试,然后 x86 CLR 与 x64 CLR 一样快。
UPDATE:It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.
更新:看起来这个问题已经被 Roslyn 修复了。相同的机器,相同的 CLR 版本——使用 VS 2013 编译时问题仍然如上,但使用 VS 2015 编译时问题消失了。
采纳答案by Eric Lippert
One of the Roslynengineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JITcompiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.
一位专门了解堆栈使用优化的Roslyn工程师看了看这个并向我报告说,C# 编译器生成局部变量存储的方式与JIT编译器注册的方式之间的交互似乎存在问题在相应的 x86 代码中调度。结果是在本地加载和存储上生成次优代码。
For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.
出于某种原因,我们所有人都不清楚,当 JITter 知道该块位于 try-protected 区域时,就避免了有问题的代码生成路径。
This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.
这很奇怪。我们将跟进 JITter 团队,看看我们是否可以输入一个错误,以便他们可以解决这个问题。
Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.
此外,我们正在为 Roslyn 改进 C# 和 VB 编译器的算法,以确定何时可以使局部变量成为“临时”——也就是说,只是在堆栈上推送和弹出,而不是在堆栈上分配特定位置激活的持续时间。我们相信 JITter 将能够在寄存器分配方面做得更好,如果我们给它更好的提示,告诉它什么时候可以更早地使本地人“死亡”。
Thanks for bringing this to our attention, and apologies for the odd behaviour.
感谢您提醒我们注意这一点,并为奇怪的行为道歉。
回答by Jon Skeet
Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:
好吧,你的计时方式对我来说看起来很糟糕。对整个循环进行计时会更明智:
var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);
That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.
这样你就不会受到微小计时、浮点运算和累积错误的影响。
Having made that change, see whether the "non-catch" version is still slower than the "catch" version.
进行更改后,查看“非捕获”版本是否仍然比“捕获”版本慢。
EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)]instead didn't help...
编辑:好的,我自己试过 - 我看到了相同的结果。很奇怪。我想知道 try/catch 是否禁用了一些不好的内联,但使用[MethodImpl(MethodImplOptions.NoInlining)]并没有帮助......
Basically you'll need to look at the optimized JITted code under cordbg, I suspect...
基本上你需要查看cordbg下优化的JITted代码,我怀疑......
EDIT: A few more bits of information:
编辑:还有一些信息:
- Putting the try/catch around just the
n++;line still improves performance, but not by as much as putting it around the whole block - If you catch a specific exception (
ArgumentExceptionin my tests) it's still fast - If you print the exception in the catch block it's still fast
- If you rethrow the exception in the catch block it's slow again
- If you use a finally block instead of a catch block it's slow again
- If you use a finally block as well asa catch block, it's fast
- 将 try/catch 放在
n++;一行上仍然可以提高性能,但不如将它放在整个块上 - 如果你捕捉到一个特定的异常(
ArgumentException在我的测试中)它仍然很快 - 如果您在 catch 块中打印异常,它仍然很快
- 如果在 catch 块中重新抛出异常,它又会变慢
- 如果你使用 finally 块而不是 catch 块,它又会变慢
- 如果您使用 finally 块和catch 块,它会很快
Weird...
奇怪的...
EDIT: Okay, we have disassembly...
编辑:好的,我们有拆卸...
This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a tryblock around everything between the variable declarations and the return statement, with just a catch{}handler. Obviously the slow version is the same except without the try/catch. The calling code (i.e. Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).
这是使用 C# 2 编译器和 .NET 2(32 位)CLR,用 mdbg 反汇编(因为我的机器上没有cordbg)。即使在调试器下,我仍然看到相同的性能效果。快速版本try在变量声明和 return 语句之间使用一个块,只有一个catch{}处理程序。显然,除了没有 try/catch 之外,慢速版本是相同的。两种情况下的调用代码(即 Main)都相同,并且具有相同的程序集表示形式(因此它不是内联问题)。
Disassembled code for fast version:
快速版本的反汇编代码:
[0000] push ebp
[0001] mov ebp,esp
[0003] push edi
[0004] push esi
[0005] push ebx
[0006] sub esp,1Ch
[0009] xor eax,eax
[000b] mov dword ptr [ebp-20h],eax
[000e] mov dword ptr [ebp-1Ch],eax
[0011] mov dword ptr [ebp-18h],eax
[0014] mov dword ptr [ebp-14h],eax
[0017] xor eax,eax
[0019] mov dword ptr [ebp-18h],eax
*[001c] mov esi,1
[0021] xor edi,edi
[0023] mov dword ptr [ebp-28h],1
[002a] mov dword ptr [ebp-24h],0
[0031] inc ecx
[0032] mov ebx,2
[0037] cmp ecx,2
[003a] jle 00000024
[003c] mov eax,esi
[003e] mov edx,edi
[0040] mov esi,dword ptr [ebp-28h]
[0043] mov edi,dword ptr [ebp-24h]
[0046] add eax,dword ptr [ebp-28h]
[0049] adc edx,dword ptr [ebp-24h]
[004c] mov dword ptr [ebp-28h],eax
[004f] mov dword ptr [ebp-24h],edx
[0052] inc ebx
[0053] cmp ebx,ecx
[0055] jl FFFFFFE7
[0057] jmp 00000007
[0059] call 64571ACB
[005e] mov eax,dword ptr [ebp-28h]
[0061] mov edx,dword ptr [ebp-24h]
[0064] lea esp,[ebp-0Ch]
[0067] pop ebx
[0068] pop esi
[0069] pop edi
[006a] pop ebp
[006b] ret
Disassembled code for slow version:
慢速版本的反汇编代码:
[0000] push ebp
[0001] mov ebp,esp
[0003] push esi
[0004] sub esp,18h
*[0007] mov dword ptr [ebp-14h],1
[000e] mov dword ptr [ebp-10h],0
[0015] mov dword ptr [ebp-1Ch],1
[001c] mov dword ptr [ebp-18h],0
[0023] inc ecx
[0024] mov esi,2
[0029] cmp ecx,2
[002c] jle 00000031
[002e] mov eax,dword ptr [ebp-14h]
[0031] mov edx,dword ptr [ebp-10h]
[0034] mov dword ptr [ebp-0Ch],eax
[0037] mov dword ptr [ebp-8],edx
[003a] mov eax,dword ptr [ebp-1Ch]
[003d] mov edx,dword ptr [ebp-18h]
[0040] mov dword ptr [ebp-14h],eax
[0043] mov dword ptr [ebp-10h],edx
[0046] mov eax,dword ptr [ebp-0Ch]
[0049] mov edx,dword ptr [ebp-8]
[004c] add eax,dword ptr [ebp-1Ch]
[004f] adc edx,dword ptr [ebp-18h]
[0052] mov dword ptr [ebp-1Ch],eax
[0055] mov dword ptr [ebp-18h],edx
[0058] inc esi
[0059] cmp esi,ecx
[005b] jl FFFFFFD3
[005d] mov eax,dword ptr [ebp-1Ch]
[0060] mov edx,dword ptr [ebp-18h]
[0063] lea esp,[ebp-4]
[0066] pop esi
[0067] pop ebp
[0068] ret
In each case the *shows where the debugger entered in a simple "step-into".
在每种情况下,都*显示了调试器在一个简单的“步入”中进入的位置。
EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of nthat's possibly faster - but when the loop takes up the bulk of the time, it's slower.
编辑:好的,我现在已经查看了代码,我想我可以看到每个版本的工作原理......我相信较慢的版本更慢,因为它使用更少的寄存器和更多的堆栈空间。对于较小的值n可能会更快 - 但是当循环占用大部分时间时,它会更慢。
Possibly the try/catch block forcesmore registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to notuse as many registers in the "normal" code.
可能 try/catch 块会强制保存和恢复更多寄存器,因此 JIT 也将它们用于循环......这恰好提高了整体性能。目前尚不清楚 JIT在“正常”代码中不使用尽可能多的寄存器是否是一个合理的决定。
EDIT: Just tried this on my x64 machine. The x64 CLR is muchfaster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.
编辑:刚刚在我的 x64 机器上试过这个。在64位CLR是多快(约3-4倍的速度)比该代码在x86 CLR,并在x64的try / catch块不会使一个显着的差异。
回答by Jeffrey Sax
Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers (esi,edi) to store one of the local variables where the slow version doesn't.
Jon 的反汇编表明,两个版本之间的区别在于快速版本使用一对寄存器 ( esi,edi) 来存储慢版本没有的局部变量。
The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.
JIT 编译器对包含 try-catch 块的代码与不包含的代码的寄存器使用做出不同的假设。这导致它做出不同的寄存器分配选择。在这种情况下,这有利于带有 try-catch 块的代码。不同的代码可能会导致相反的效果,因此我不会将此视为通用加速技术。
In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.
最后,很难判断哪个代码最终会运行得最快。像寄存器分配和影响它的因素是如此低级的实现细节,我看不到任何特定技术如何可靠地生成更快的代码。
For example, consider the following two methods. They were adapted from a real-life example:
例如,请考虑以下两种方法。它们改编自一个现实生活中的例子:
interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed {
public int[] Array;
public int this[int index] {
get { return Array[index]; }
set { Array[index] = value; }
}
}
static int Generic<T>(int length, T a, T b) where T : IIndexed {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
One is a generic version of the other. Replacing the generic type with StructArraywould make the methods identical. Because StructArrayis a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.
一个是另一个的通用版本。用 替换泛型类型StructArray会使方法相同。因为StructArray是值类型,所以它获得自己编译的泛型方法版本。然而实际运行时间明显长于专用方法,但仅适用于 x86。对于 x64,时间几乎相同。在其他情况下,我也观察到 x64 的差异。
回答by miller the gorilla
I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method. Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else.
我会把它作为评论,因为我真的不确定这是否可能是这种情况,但我记得它并没有一个 try/except 语句涉及对垃圾处理机制的修改方式编译器工作,因为它以递归的方式从堆栈中清除对象内存分配。在这种情况下可能没有要清除的对象,或者 for 循环可能构成一个闭包,垃圾收集机制识别出足以强制执行不同的收集方法。可能不是,但我认为值得一提,因为我没有在其他任何地方看到它讨论过。
回答by Hans Passant
This looks like a case of inlining gone bad. On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. The ecx register becomes available in a static method, it doesn't have to store this. The eax register often is needed for calculations. But these are 32-bit registers, for variables of type long it must use a pair of registers. Which are edx:eax for calculations and edi:ebx for storage.
这看起来像是内联变坏的情况。在 x86 内核上,抖动具有 ebx、edx、esi 和 edi 寄存器,可用于局部变量的通用存储。ecx 寄存器在静态方法中可用,它不必存储this。eax 寄存器经常用于计算。但这些是 32 位寄存器,对于 long 类型的变量,它必须使用一对寄存器。其中 edx:eax 用于计算,edi:ebx 用于存储。
Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used.
这是慢版反汇编中最突出的地方,既不使用 edi 也不使用 ebx。
When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. Which permits several instructions to run concurrently, even when they use the same register. Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15).
当抖动找不到足够的寄存器来存储局部变量时,它必须生成代码以从堆栈帧加载和存储它们。这会减慢代码速度,阻止名为“寄存器重命名”的处理器优化,这是一种内部处理器核心优化技巧,它使用多个寄存器副本并允许超标量执行。这允许多个指令同时运行,即使它们使用相同的寄存器。没有足够的寄存器是 x86 内核上的一个常见问题,在 x64 中得到解决,它有 8 个额外的寄存器(r9 到 r15)。
The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. In other words, not make a call to the method but generate the code for the method inline in the Main() method. Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds.
抖动会尽力应用另一个代码生成优化,它会尝试内联您的 Fibo() 方法。换句话说,不是调用方法,而是为 Main() 方法中的内联方法生成代码。非常重要的优化之一是免费创建 C# 类的属性,赋予它们字段的性能。它避免了进行方法调用和设置其堆栈帧的开销,节省了几纳秒。
There are several rules that determine exactly when a method can be inlined. They are not exactly documented but have been mentioned in blog posts. One rule is that it won't happen when the method body is too large. That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based.
有几个规则可以准确确定何时可以内联方法。它们没有完全记录,但已在博客文章中提及。一个规则是当方法体太大时它不会发生。这会抵消内联带来的好处,它会生成太多不适合 L1 指令缓存的代码。另一个适用于此的硬性规则是,当一个方法包含 try/catch 语句时,它不会被内联。其背后的背景是异常的实现细节,它们依赖于 Windows 对基于堆栈帧的 SEH(结构异常处理)的内置支持。
One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. It appears to be aware of when the jitter is trying to inline a method. One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. But not edi:ebx. No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers.
通过使用此代码可以推断出抖动中寄存器分配算法的一种行为。它似乎知道抖动何时尝试内联方法。它似乎使用的一个规则是,只有 edx:eax 寄存器对可用于具有 long 类型局部变量的内联代码。但不是 edi:ebx。毫无疑问,因为这对调用方法的代码生成太不利了,edi 和 ebx 都是重要的存储寄存器。
So you get the fast version because the jitter knows up front that the method body contains try/catch statements. It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. You got the slow version because the jitter didn't know up front that inlining wouldn't work. It only found out aftergenerating the code for the method body.
所以你得到了快速版本,因为抖动预先知道方法体包含 try/catch 语句。它知道它永远不会被内联,所以很容易使用 edi:ebx 来存储 long 变量。你得到了慢版本,因为抖动不知道内联不起作用。它只有在为方法体生成代码后才发现。
The flaw then is that it didn't go back and re-generatethe code for the method. Which is understandable, given the time constraints it has to operate in.
然后的缺陷是它没有返回并重新生成该方法的代码。考虑到它必须运行的时间限制,这是可以理解的。
This slow-down doesn't occur on x64 because for one it has 8 more registers. For another because it can store a long in just one register (like rax). And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers.
在 x64 上不会发生这种减速,因为对于其中一个,它还有 8 个寄存器。另一个原因是它可以在一个寄存器中存储一个 long(如 rax)。当您使用 int 而不是 long 时不会发生减速,因为抖动在选择寄存器方面具有更大的灵活性。

