C++ GCC 编译错误,代码超过 2 GB

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6296837/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:52:31  来源:igfitidea点击:

GCC compile error with >2 GB of code

c++cgcccompiler-errors

提问by bbtrb

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

我有大量的函数,总共大约 2.8 GB 的目标代码(不幸的是,没有办法,科学计算......)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32Serrors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpicflag.

当我尝试链接它们时,我得到(预期的)relocation truncated to fit: R_X86_64_32S错误,我希望通过指定编译器标志来规避这些错误-mcmodel=medium。除了我可以控制的所有链接的库都使用该-fpic标志进行编译。

Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.

尽管如此,错误仍然存​​在,我假设我链接到的某些库不是用 PIC 编译的。

Here's the error:

这是错误:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

And system libraries I link against:

和我链接的系统库:

-lgfortran -lm -lrt -lpthread

Any clues where to look for the problem?

任何线索在哪里寻找问题?

EDIT: First of all, thank you for the discussion ... To clarify a bit, I have hundreds of functions (each approx 1 MB in size in separate object files) like this:

编辑:首先,感谢您的讨论......为了澄清一点,我有数百个函数(在单独的目标文件中每个大小约为 1 MB),如下所示:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object sis relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while tijust returns a double from an external library. As you can see, csc[]is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

该对象s相对较小,保留了所需的常量 x14、x15、...、ds0、...等,同时ti只从外部库返回一个双精度值。正如您所看到的,csc[]是一个预先计算的值映射,它也在以下形式的单独目标文件(再次数百个,每个大约 1 MB 大小)中进行评估:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...

       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

That's about it. The final step then just consists in calling all those func[i]and summing the result up.

就是这样。最后一步就是调用所有这些func[i]并总结结果。

Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.

关于这是一个相当特殊和不寻常的情况:是的,确实如此。这是人们在尝试对粒子物理学进行高精度计算时必须应对的问题。

EDIT2: I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 10^5 to 10^6 times...

EDIT2:我还应该补充一点,x12、x13 等并不是真正的常量。它们被设置为特定值,运行所有这些函数并返回结果,然后选择一组新的 x12、x13 等来生成下一个值。这必须做 10^5 到 10^6 次......

EDIT3: Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

EDIT3:感谢您到目前为止的建议和讨论......我会尝试以某种方式在代码生成时滚动循环,老实说,不确定如何做到这一点,但这是最好的选择。

BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize". It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application ...

顺便说一句,我并没有试图躲在“这是科学计算——没有办法优化”的背后。只是这段代码的基础是从我无法真正访问的“黑匣子”中出来的东西,而且,整个事情在简单的例子中效果很好,我主要对真实发生的事情感到不知所措世界应用...

EDIT4: So, I have managed to reduce the code size of the cscdefinitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100 MB) and I hope this idea works.

EDIT4:因此,csc通过简化计算机代数系统 ( Mathematica) 中的表达式,我设法将定义的代码大小减少了大约四分之一。我现在还看到了一些方法,通过在生成代码之前应用一些其他技巧(这将使这部分减少到大约 100 MB),将它减少另一个数量级左右,我希望这个想法有效。

Now related to your answers: I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

现在与您的答案有关:我正在尝试在funcs 中再次回滚循环,其中 CAS 不会有太大帮助,但我已经有了一些想法。例如,按变量对表达式进行排序x12, x13,...csc使用 Python解析s 并生成将它们相互关联的表。然后我至少可以将这些部分生成为循环。由于这似乎是迄今为止最好的解决方案,因此我将其标记为最佳答案。

However, I'd like to also give credit to VJo. GCC 4.6 indeed works muchbetter, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

但是,我也想赞扬 VJo。GCC 4.6 确实工作更好,生成的代码更小,速度更快。使用大模型可以按原样处理代码。所以从技术上讲,这是正确的答案,但改变整个概念是一种更好的方法。

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

感谢大家的建议和帮助。如果有人感兴趣,我会在准备好后尽快发布最终结果。

REMARKS: Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computableform generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measureable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and onlyavailable ansatzwhich is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

备注:只是对其他一些答案的一些评论:我试图运行的代码并非源于简单函数/算法的扩展和愚蠢的不必要的展开。实际发生的是,我们开始的东西是非常复杂的数学对象,并将它们转化为可数值计算的形式会生成这些表达式。问题实际上在于潜在的物理理论。中间表达式的复杂性按因子缩放,这是众所周知的,但是当将所有这些东西组合到物理上可测量的东西——可观察的东西时——它只是归结为形成表达式基础的少数非常小的函数。(在这方面肯定有一些“错误”与一般和可用ansatz被称为“微扰理论”)我们试图将这个 ansatz 带到另一个层次,这在分析上不再可行,并且所需函数的基础未知。所以我们尝试像这样暴力破解它。不是最好的方法,但希望最终能帮助我们理解手头的物理学……

LAST EDIT:Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

上次编辑:感谢您的所有建议,我设法使用 Mathematica 和对funcs的代码生成器的修改,大大减少了代码大小,这有点类似于顶级答案:)

I have simplified the cscfunctions with Mathematica, bringing it down to 92 MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10 minutes on a single CPU.

csc用 Mathematica简化了这些功能,将其降低到 92 MB。这是不可约的部分。第一次尝试花费了很长时间,但经过一些优化,现在在单个 CPU 上运行大约需要 10 分钟。

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

funcs的影响非常显着:它们的整个代码大小下降到大约 9 MB,因此代码现在总计在 100 MB 范围内。现在开启优化是有意义的,并且执行速度非常快。

Again, thank you all for your suggestions, I've learned a lot.

再次感谢大家的建议,我学到了很多。

采纳答案by Andrei

So, you already have a program that produces this text:

因此,您已经有一个生成此文本的程序:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

对?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

如果您的所有函数都具有类似的“格式”(将 n 个数字乘以 m 次并添加结果 - 或类似的东西),那么我认为您可以这样做:

  • change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
  • create an array of such offsets
  • write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result
  • 将生成器程序更改为输出偏移量而不是字符串(即,它将生成的不是字符串“s.ds0” offsetof(ProcessVars, ds0)
  • 创建一个此类偏移量的数组
  • 编写一个求值器,它接受上面的数组和结构指针的基地址并产生结果

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

数组+求值器将表示与您的函数之一相同的逻辑,但只有求值器是代码。该数组是“数据”,可以在运行时生成或保存在磁盘上并读取 i 块或使用内存映射文件。

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of sand cscand also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8and csc[51370]

对于您在 func1 中的特定示例,想象一下如果您可以访问 的基地址s以及csc您需要添加到基地址以到达的常量和偏移量的表示的向量,您将如何通过求值器重写函数x14ds8csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

您需要创建一种新的“数据”形式,以描述如何处理传递给大量函数的实际数据。

回答by B?ови?

The x86-64 ABI used by Linuxdefines a "large model" specifically to avoid such size limitations, which includes 64-bit relocation types for the GOT and PLT. (See the table in section 4.4.2, and the instruction sequences in 3.5.5 which show how they are used.)

Linux 使用x86-64 ABI定义了一个“大模型”,专门用于避免这种大小限制,其中包括 GOT 和 PLT 的 64 位重定位类型。(参见第 4.4.2 节中的表格,以及 3.5.5 中显示它们如何使用的指令序列。)

Since your functions are occupying 2.8 GB, you are out of luck, because gcc doesn't support large models. What you can do, is to reorganize your code in such a way that would allow you to split it into shared libraries which you would dynamically link.

由于您的函数占用 2.8 GB,因此您很不走运,因为 gcc 不支持大型模型。您可以做的是,以允许您将其拆分为动态链接的共享库的方式重新组织您的代码。

If that is not possible, as someone suggested, instead of putting your data into code (compiling and linking it), since it is huge, you can load it at run time (either as a normal file, or you can mmap it).

如果这是不可能的,正如有人建议的那样,不要将您的数据放入代码中(编译和链接它),因为它很大,您可以在运行时加载它(作为普通文件,或者您可以将其映射)。

EDIT

编辑

Seems like the large model is supported by gcc 4.6 (see this page). You can try that, but the above still applies about reorganizing your code.

似乎 gcc 4.6 支持大型模型(请参阅此页面)。您可以尝试这样做,但上述内容仍然适用于重新组织您的代码。

回答by bdonlan

With a program of that side, cache misses for code are very likely to exceed the costs of looping at runtime. I would recommend that you go back to your code generator, and have it generate some compactrepresentation for what it wants evaluated (ie, one likely to fit in D-cache), then execute that with an interpreter in your program. You could also see if you can factor out smaller kernels that still have a significant number of operations, then use those as 'instructions' in the interpreted code.

使用该端的程序,代码的缓存未命中很可能超过运行时循环的成本。我建议你回到你的代码生成器,让它为它想要评估的内容生成一些紧凑的表示(即,一个可能适合 D-cache 的),然后在你的程序中使用解释器执行它。您还可以查看是否可以分解出仍然具有大量操作的较小内核,然后将它们用作解释代码中的“指令”。

回答by zvrba

The error occurs because you have too much CODE, not data! This is indicated by for example __libc_csu_fini(which is a function) being referenced from _startand the relocation is truncated to fit. This means that _start(the program's true entry point) is trying to call that function via a SIGNED 32-bit offset, which has only a range of 2 GB. Since the total amount of your object code is ~2.8 GB, the facts check out.

发生错误是因为您的 CODE 过多,而不是数据!这通过例如__libc_csu_fini(这是一个函数)被引用_start并且重定位被截断以适应。这意味着_start(程序的真正入口点)正在尝试通过 SIGNED 32 位偏移量调用该函数,该偏移量只有 2 GB 的范围。由于您的目标代码总量约为 2.8 GB,因此请查看事实。

If you could redesign your data structures, much of your code could be "compressed" by rewriting the huge expressions as simple loops.

如果你可以重新设计你的数据结构,你的大部分代码都可以通过将巨大的表达式重写为简单的循环来“压缩”。

Also, you could compute csc[]in a different program, store the results in a file, and just load them when necessary.

此外,您可以csc[]在不同的程序中进行计算,将结果存储在文件中,并在必要时加载它们。

回答by AlefSin

I think everybody agrees there should be a different way to do what you want to do. Compiling hundreds of megabyte (gigabytes?) of code, linking it into a multi-gigabyte sized executable and running it just sounds very inefficient.

我想每个人都同意应该有一种不同的方式来做你想做的事。编译数百兆字节(千兆字节?)的代码,将其链接到一个多千兆字节大小的可执行文件并运行它听起来非常低效。

If I understand your problem correctly, you use some sort of code generator, G, to generate a bunch of functions func1...Nwhich take a bunch of maps csc1...Mas input. What you want to do is to calculated csc1...M, and run a loop of 1,000,000 times for different inputs and each time find s = func1 + func2 + ... + funcN. You didn't specify how fucn1...Nare related to csc1...Mthough.

如果我正确理解您的问题,您可以使用某种代码生成器 G 来生成一组函数func1...N,这些函数将一组地图csc1...M作为输入。你想要做的是计算csc1...M,并为不同的输入运行 1,000,000 次循环,每次找到s = func1 + func2 + ... + funcN。你没有具体说明如何fucn1...N相关csc1...M

If all that is true, it seems that you should be able to turn the problem on its head in different way which can potentially be much more manageable and even possibly faster (i.e. letting your machine's cache to actually function).

如果这一切都是真的,那么您似乎应该能够以不同的方式解决问题,这可能更易于管理,甚至可能更快(即让您的机器的缓存实际运行)。

Besides the practical problem of the object files sizes, your current program will not be efficient since it does not localize access to the data (too many huge maps) and has no localized code execution (too many very long functions).

除了目标文件大小的实际问题之外,您当前的程序不会高效,因为它没有本地化对数据的访问(太多巨大的映射)并且没有本地化的代码执行(太多很长的函数)。

How about breaking your program into 3 phase: Phase 1 build csc1...Mand storing them. Phase 2 build one funcat a time, run it 1,000,000 times with each input and store the results. Phase 3 find the sum of the results of the stored func1...Noutcomes for each run out of 1,000,000 times. The good part about this solution is that it can be easily made parallel across several independent machines.

将您的程序分成 3 个阶段如何:第 1 阶段构建csc1...M和存储它们。阶段 2 一次构建一个func,对每个输入运行 1,000,000 次并存储结果。阶段 3 计算func1...N每次运行 1,000,000 次的存储结果的总和。此解决方案的优点在于,它可以轻松地跨多个独立机器并行运行。

Edit: @bbtrb, could you make one func and one csc available somehwere? They seem to be highly regular and compressible. For instance, func1 seems to be just a sum of expressions each consisting of 1 coefficient, 2 indexes to the variables in s and 1 index into csc. So it can be reduced to a nice loop. If you make complete examples available, I'm sure ways can be found to compress them into loops rather than long expressions.

编辑:@bbtrb,你能不能让一个 func 和一个 csc 可用?它们似乎是高度规则和可压缩的。例如,func1 似乎只是表达式的总和,每个表达式由 1 个系数、2 个指向 s 中变量的索引和 1 个指向 csc 的索引组成。所以它可以简化为一个很好的循环。如果您提供完整的示例,我相信可以找到将它们压缩为循环而不是长表达式的方法。

回答by AProgrammer

If I read your errors correctly, what makes you carry over the limit is the initialized data section (if it was the code, you would have far more errors IMHO). Do you have big arrays of global data? If it is the case, I'd restructure the program so that they are allocated dynamically. If the data is initialized, I'd read it from a configuration file.

如果我正确阅读了您的错误,那么使您超出限制的是初始化数据部分(如果是代码,恕我直言,您会有更多错误)。你有大量的全球数据吗?如果是这种情况,我会重组程序,以便动态分配它们。如果数据被初始化,我会从配置文件中读取它。

BTW seeing this:

BTW看到这个:

(.text+0x20): undefined reference to `main'

(.text+0x20): 对“main”的未定义引用

I think you have another problem.

我想你还有另一个问题。

回答by malkia

A couple of suggestions: - Optimize for size (-Os). Make your inline function calls, normal function calls. Enable string pooling.

一些建议: - 优化大小 (-Os)。进行内联函数调用,普通函数调用。启用字符串池。

Try splitting the things into different DLL's (shared objects, .so for linux, .dylib for Mac OS X). Make sure that they can be unloaded. Then implement something to load things on demand, and free them when not needed.

尝试将这些内容拆分为不同的 DLL(共享对象,Linux 为 .so,Mac OS X 为 .dylib)。确保它们可以卸载。然后实现一些东西来按需加载东西,并在不需要时释放它们。

If not, split your code into different executables, and use something to communicate between them (pipes, sockets, even writing / reading to file). Clumsy, but what options do you have?

如果没有,请将您的代码拆分为不同的可执行文件,并使用某些内容在它们之间进行通信(管道、套接字,甚至写入/读取文件)。笨拙,但你有什么选择?

Totally alternative: - Use a dynamic language with JIT. Right on top of my head - use LuaJIT- and rewrite (regenerate?) a lot of these expressions in Lua, or other such languages and runtimes that allow code to be garbage collected.

完全替代: - 使用带有JIT的动态语言。就在我的头上 - 使用LuaJIT- 并在Lua或其他允许代码被垃圾收集的语言和运行时中重写(重新生成?)很多这些表达式。

LuaJIT is quite efficient, sometimes beating C/C++ for certain things, but often very close (sometimes can be slow due to poor garbage collection yet there). Check for yourself:

LuaJIT 非常高效,有时在某些事情上比 C/C++ 好,但通常非常接近(有时可能由于垃圾收集不佳而很慢)。自己检查:

http://luajit.org/performance_x86.html

http://luajit.org/performance_x86.html

Download the scimark2.luafile from there, and compare it with the "C" version (google it) - often results are very close.

scimark2.lua从那里下载文件,并将其与“C”版本(谷歌它)进行比较 - 结果通常非常接近。

回答by Donal Fellows

It looks to me like the code is doing numerical integration using some kind of adaptive depth method. Unfortunately, the code generator (or rather the author of the code generator) is so stupidas to generate one function per patch rather than one per typeof patch. As such, it's produced too much code to be compiled, and even if it could be compiled its execution would be painful because nothing's ever shared anywhere ever. (Can you imagine the pain resulting by having to load each page of object code from disk because nothing is ever shared and so it's always a candidate for the OS to evict. To say nothing of instruction caches, which are going to be useless.)

在我看来,代码正在使用某种自适应深度方法进行数值积分。不幸的是,代码生成器(或者更确切地说是代码生成器的作者)太愚蠢了,以至于每个补丁生成一个函数,而不是每个补丁类型一个。因此,它产生了太多需要编译的代码,即使可以编译它的执行也会很痛苦,因为从来没有任何地方共享过任何东西。(你能想象必须从磁盘加载每一页目标代码所带来的痛苦,因为没有任何东西是共享的,所以它总是操作系统驱逐的候选者。更不用说指令缓存了,这将是无用的。)

The fix is to stop unrolling everything; for this sort of code, you want to maximize sharingas the overhead of extra instructions to access data in more complex patterns will be absorbed by the cost of dealing with the (presumably) large underlying dataset anyway. It's also possible that the code generator will even do this by default, and that the scientist saw some options for unrolling (with the note that these sometimes improve speed) and turned them all on at once and is now insisting that this resulting mess be accepted by the computer, rather than accepting the machine's real restrictions and using the numerically correct version that is generated by default. But if the code generator won't do it, get one that will (or hack the existing code).

解决方法是停止展开所有内容;对于此类代码,您希望最大限度地共享,因为以更复杂的模式访问数据的额外指令的开销将被处理(大概)大型底层数据集的成本所吸收。也有可能代码生成器在默认情况下会这样做,并且科学家看到了一些展开的选项(注意这些有时会提高速度)并立即将它们全部打开,现在坚持接受这种结果的混乱由计算机,而不是接受机器的实际限制并使用默认生成的数字正确版本。但是,如果代码生成器不会这样做,请获取一个会(或破解现有代码)。

The bottom line:compiling and linking 2.8GB of code doesn't work and shouldn't be forced to work. Find another way.

底线:编译和链接 2.8GB 的​​代码不起作用,不应该被迫工作。寻找另一种方式。

回答by ajklbahu8geag

The linker is attempting to generate 32-bit relocation offsets within a binary that has somehow exceeded these limitations. Try reduce the main program's address space requirements.

链接器试图在以某种方式超出这些限制的二进制文件中生成 32 位重定位偏移量。尝试减少主程序的地址空间需求。

Can you split some/most of the object code into one or more libraries (also compiled with -fpic / -fPIC)? Then generate a non-static binary that links against these libs. The libraries will live in discrete memory blocks and your relocation offsets will be dynamic/absolute (64-bit) rather than relative (32-bit).

您能否将部分/大部分目标代码拆分为一个或多个库(也使用 -fpic / -fPIC 编译)?然后生成一个链接到这些库的非静态二进制文件。这些库将存在于离散的内存块中,您的重定位偏移将是动态/绝对(64 位)而不是相对(32 位)。

回答by Brian

Those expressions look a lot like an alternating series to me. I don't know what the rest of the code looks like, but it doesn't seem like it'd be that hard to derive the generating expression. It'd probably be worth it at execution time too, especially if you have 2.8 GB of 2 KB unrolled code.

这些表达对我来说看起来很像一个交替的系列。我不知道代码的其余部分是什么样子,但推导出生成表达式似乎并不难。在执行时也可能是值得的,特别是如果您有 2.8 GB 的 2 KB 展开代码。