C 和汇编程序实际上编译成什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2135788/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:13:05  来源:igfitidea点击:

What do C and Assembler actually compile to?

c++ccompiler-constructionlinkerassembly

提问by lamas

So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

所以我发现 C(++) 程序实际上并没有编译成简单的“二进制”(我可能在这里弄错了一些东西,在那种情况下我很抱歉 :D)但是可以编译成一系列的东西(符号表, 操作系统相关的东西,...) 但是...

  • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

  • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

  • 汇编器是否“编译”为纯二进制?这意味着除了预定义字符串等资源之外没有额外的东西。

  • 如果 C 编译成其他东西而不是纯二进制,那么这个小的汇编引导加载程序如何将指令从硬盘复制到内存并执行它们?我的意思是,如果操作系统内核(可能是用 C 编写的)编译为与普通二进制不同的东西 - 引导加载程序如何处理它?

edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

编辑:我知道汇编器不会“编译”,因为它只有您机器的指令集 - 我没有找到关于汇编器“汇编”到什么的好词。如果你有的话,把它留在这里作为评论,我会改变它。

采纳答案by Norman Ramsey

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

C 通常编译为汇编程序,只是因为这使糟糕的编译器编写者的生活变得轻松。

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

汇编代码总是汇编(而不是“编译”)为可重定位的目标代码。您可以将其视为二进制机器代码和二进制数据,但有很多装饰和元数据。关键部分是:

  • Code and data appear in named "sections".

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

  • 代码和数据出现在命名的“部分”中。

  • 可重定位的目标文件可能包括标签的定义,它指的是节中的位置。

  • 可重定位的目标文件可能包括要用别处定义的标签值填充的“空洞”。这种洞的正式名称是搬迁条目

For example, if you compile and assemble (but don't link) this program

例如,如果你编译和汇编(但不链接)这个程序

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

你很可能会得到一个可重定位的目标文件

  • A textsection containing the machine code for main

  • A label definition for mainwhich points to the beginning of the text section

  • A rodata(read-only data) section containing the bytes of the string literal "Hello, world\n"

  • A relocation entry that depends on printfand that points to a "hole" in a call instruction in the middle of a text section.

  • text含有用于机器代码部main

  • main指向文本部分开头的标签定义

  • rodata包含字符串文字的字节(只读数据)部"Hello, world\n"

  • 依赖printf并指向文本部分中间调用指令中的“洞”的重定位条目。

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

如果您在 Unix 系统上,可重定位的目标文件通常称为 .o 文件,如hello.o,您可以使用名为 的简单工具探索标签定义和用途nm,并且您可以从稍微复杂的工具中获得更详细的信息称为objdump

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

我教了一门涵盖这些主题的课程,我让学生编写汇编器和链接器,这需要几周时间,但是当他们完成后,他们中的大多数人都对可重定位目标代码有很好的处理能力。这不是一件容易的事情。

回答by Paul Nathan

Let's take a C program.

让我们来看一个 C 程序。

When you run gcc, clang, or 'cl' on the c program, it will go through these stages:

当您在 c 程序上运行gccclang或 'cl' 时,它将经历以下阶段:

  1. Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
  2. Lexical analysis (producing tokens and lexical errors).
  3. Syntactical analysis (producing a parse tree and syntactical errors).
  4. Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
  5. Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
  6. Outputing into assembly source (or another intermediate format like .NET IL bytecode)
  7. Assembling of the assembly into some binary object format.
  8. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  9. Output of final executable in elf, PE/coff, MachO64, or whatever other format
  1. 预处理器(#include、#ifdef、trigraph 分析、编码翻译、注释管理、宏...)包括词法分析到预处理器标记中,并最终生成用于输入到编译器的纯文本。
  2. 词法分析(产生标记和词法错误)。
  3. 语法分析(生成解析树和语法错误)。
  4. 语义分析(生成符号表、范围信息和范围/输入错误)还有数据流,将程序逻辑转换为优化器可以使用的“中间表示”。(通常是SSA)。clang/LLVM 使用 LLVM-IR,gcc 使用 GIMPLE,然后是 RTL。
  5. 程序逻辑的优化,包括常量传播、内联、在循环外提升不变量、自动向量化和许多其他事情。(广泛使用的现代编译器的大部分代码都是优化传递。)通过中间表示进行转换只是某些编译器工作方式的一部分,因此“禁用所有优化”变得不可能/毫无意义
  6. 输出到程序集源(或其他中间格式,如.NET IL 字节码
  7. 将程序集组装成某种二进制对象格式。
  8. 将程序集链接到任何需要的静态库中,并在需要时重新定位它。
  9. elf、PE/coff、MachO64 或任何其他格式的最终​​可执行文件的输出

In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)

在实践中,其中一些步骤可能会同时完成,但这是逻辑顺序。大多数编译器都有在任何给定步骤(例如预处理或汇编)之后停止的选项,包括在 GCC 等开源编译器的优化传递之间转储内部表示。( -ftree-dump-...)

Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .comexecutable

请注意,实际可执行二进制文件周围有一个 elf 或 coff 格式的“容器”,除非它是 DOS.com可执行文件

You will find that a book on compilers(I recommend the Dragonbook, the standard introductory book in the field) will have allthe information you need and more.

您会发现一本关于编译器的书(我推荐Dragon书,该领域的标准入门书)将包含您需要的所有信息以及更多信息。

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaderscovers.

正如 Marco 评论的那样,链接和加载是一个很大的领域,Dragon book 或多或少停在可执行二进制文件的输出处。实际上从那里到在操作系统上运行是一个相当复杂的过程,链接器和加载器中的Levine 涵盖了这一过程。

I've wiki'd this answer to let people tweak any errors/add information.

我已经维基了这个答案,让人们调整任何错误/添加信息。

回答by Thomas Matthews

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.

将 C++ 转换为二进制可执行文件有不同的阶段。语言规范没有明确说明翻译阶段。但是,我将描述常见的翻译阶段。

Source C++ To Assembly or Itermediate Language

源 C++ 到汇编或中间语言

Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.

一些编译器实际上将 C++ 代码翻译成汇编语言或中间语言。这不是必需的阶段,但有助于调试和优化。

Assembly To Object Code

汇编到目标代码

The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.

下一个常见步骤是将汇编语言翻译成目标代码。目标代码包含具有相对地址和对外部子程序(方法或函数)的开放引用的汇编代码。通常,翻译器将尽可能多的信息放入目标文件中,其他一切都未解决

Linking Object Code(s)

链接目标代码

The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executablefile. This file contains information for the operating system and relativeaddresses.

链接阶段结合一个或多个目标代码,解析引用并消除重复的子程序。最终输出是一个可执行文件。该文件包含操作系统和相关地址的信息。

Executing BinaryFiles

执行二进制文件

The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).

操作系统通常从硬盘驱动器加载可执行文件,并将其放入内存中。操作系统可以将相对地址转换为物理位置。操作系统还可以准备可执行文件(可能在可执行文件中说明)所需的资源(例如 DLL 和 GUI 小部件)。

Compiling Directly To Binary Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.

直接编译为二进制 一些编译器,例如嵌入式系统中使用的编译器,能够从 C++ 直接编译为可执行的二进制代码。此代码将具有物理地址而不是相对地址,并且不需要加载操作系统。

Advantages

好处

One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.

这些阶段的优点之一是可以将 C++ 程序分解为多个部分,单独编译并在以后链接。它们甚至可以与来自其他开发人员(又名库)的部分链接。这允许开发人员仅编译开发中的部分并链接已经验证的部分。通常,从 C++ 到对象的转换是该过程中耗时的部分。此外,当源代码中存在错误时,人们不希望等待所有阶段完成。

Keep an open mind and always expect the Third Alternative (Option).

保持开放的心态,并始终期待第三种选择(选项)

回答by t0mm13b

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.

为了回答您的问题,请注意这是主观的,因为有不同的处理器、不同的平台、不同的汇编器和 C 编译器,在这种情况下,我将谈论 Intel x86 平台。

  1. Assemblers do not compile to pure binary, they are raw machine code, defined with segments, such as data, text and bss to name but a few, this is called object code. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you compile using gcc is 'a.out', that is a shorthand for Assembler Output.
  2. Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
  1. 汇编程序不编译为纯二进制,它们是原始机器代码,用段定义,例如数据、文本和 bss 仅举几例,这称为目标代码。链接器介入并调整段以使其可执行,即准备好运行。顺便说一句,使用 gcc 编译时的默认输出是“a.out”,这是汇编器输出的简写。
  2. 引导加载程序定义了一个特殊的指令,在 DOS 时代,通常会找到诸如 的指令.Org 100h,该指令将汇编代码定义为 .EXE 流行之前的旧 .COM 变体。此外,您不需要使用汇编器来生成 .COM 文件,使用 MSDOS 附带的旧 debug.exe,为小型简单程序提供了技巧,.COM 文件不需要链接器,直接准备好了 -运行二进制格式。这是一个使用 DEBUG 的简单会话。
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q

This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.

这将生成一个名为“respond.com”的准备运行的 .COM 程序,该程序等待击键而不将其回显到屏幕。注意,一开始,'a 100h' 的用法表明指令指针从 100h 开始,这是 .COM 的特征。这个旧脚本主要用于等待响应而不是回显它的批处理文件。原始脚本可以在这里找到。

Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.

同样,在引导加载程序的情况下,它们被转换为二进制格式,DOS 中曾经有一个程序,称为EXE2BIN。那就是将原始目标代码转换为可以复制到可启动磁盘以进行启动的格式。请记住,没有针对汇编代码运行链接器,因为链接器用于运行时环境并设置代码以使其可运行和可执行。

The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.

BIOS 启动时,期望代码位于 segment:offset, 0x7c00,如果我的记忆正确,代码(在 EXE2BIN 之后)将开始执行,然后引导加载程序将自身重新定位到内存中的较低位置并继续加载发出 int 0x13 从磁盘读取,打开 A20 门,启用 DMA,在 BIOS 处于 16 位模式时切换到保护模式,然后从磁盘读取的数据加载到内存中,然后引导加载程序发出远跳转到数据代码中(可能是用 C 编写的)。这实质上是系统启动的方式。

Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

好吧,上一段听起来抽象而简单,我可能错过了一些东西,但简而言之就是这样。

Hope this helps, Best regards, Tom.

希望这会有所帮助,最好的问候,汤姆。

回答by Potatoswatter

You have a lot of answers to read through, but I think I can keep this succinct.

你有很多答案要通读,但我想我可以保持简洁。

"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.

“二进制代码”是指通过微处理器电路馈送的位。微处理器按顺序从内存中加载每条指令,按照指令执行。不同的处理器系列具有不同的指令格式:x86、ARM、PowerPC 等。您可以通过为处理器提供内存中指令的地址来将其指向所需的指令,然后它会在程序的其余部分愉快地运行。

When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it hasbinary code, and what its address space should look like.

当您想将程序加载到处理器中时,您首先必须使二进制代码在内存中可访问,因此它首先具有地址。C 编译器在文件系统中输出一个文件,该文件必须加载到新的虚拟地址空间中。因此,除了二进制代码之外,该文件还必须包括它具有二进制代码的信息,以及它的地址空间应该是什么样的。

A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.

引导加载程序有不同的要求,因此其文件格式可能不同。但想法是一样的:二进制代码始终是更大文件格式的有效负载,其中至少包括完整性检查,以确保它以正确的指令集编写。

C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.

C 编译器和汇编器通常配置为生成静态库文件。对于嵌入式应用程序,您更有可能找到一个编译器,它生成类似原始内存映像的内容,指令从地址零开始。否则,您可以编写一个链接器,将 C 编译器的输出转换为您想要的任何其他内容。

回答by Kornel Kisielewicz

There are two things that you may mix here. Generally there are two topics:

有两件事你可以在这里混合。一般有两个主题:

The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it maybe compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.

后者可能在汇编过程中编译为前者。一些中间格式不是组装的,而是由虚拟机执行的。在 C++ 的情况下,它可能被编译成 CIL,它被组装成一个 .NET 程序集,因此我有些困惑。

But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.

但一般而言,C 和 C++ 通常被编译为二进制文件,或者换句话说,编译为可执行文件格式。

回答by Steven Sudit

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).

它们编译为特定格式的文件(Windows 的 COFF 等),由头和段组成,其中一些具有“纯二进制”操作码。汇编器和编译器(例如 C)创建相同类型的输出。某些格式,例如旧的 *.COM 文件,没有标头,但仍然有某些假设(例如它会在内存中的哪个位置加载或它有多大)。

On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.

在 Windows 机器上,操作系统的 boostrapper 位于由 BIOS 加载的磁盘扇区中,这两个扇区都是“普通的”。一旦操作系统加载了它的加载器,它就可以读取具有标题和段的文件。

Does that help?

这有帮助吗?

回答by Daniel Bingham

To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.

要回答问题的程序集部分,我所理解的程序集不会编译为二进制文件。汇编 === 二进制。它直接翻译。每个组装操作都有一个直接匹配的二进制字符串。每个操作都有一个二进制代码,每个寄存器变量都有一个二进制地址。

That is, unless Assembler != Assembly and I'm misunderstanding your question.

也就是说,除非 Assembler != Assembly 并且我误解了您的问题。

回答by Laizer

As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)

据我了解,芯片组(CPU 等)将具有一组用于存储数据的寄存器,并了解一组用于操作这些寄存器的指令。指令将是诸如“将此值存储到此寄存器”、“移动此值”或“比较这两个值”之类的内容。这些指令通常用简短的人类可理解的字母代码(汇编语言或汇编程序)表示,这些代码映射到芯片组理解的数字——这些数字以二进制(机器代码)形式呈现给芯片。

Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.

这些代码是软件所达到的最低级别。比这更深入的是实际芯片的架构,这是我没有涉及的事情。

回答by Laizer

There's plenty of answers above for you to look at, but I thought I'd add these resources that'll give you a flavour of what happens. Basically, on Windows and linux, someone has tried to create the tiniest executable possible; in Linux, ELF, windows, PE.

上面有很多答案供您查看,但我想我会添加这些资源,让您了解所发生的事情。基本上,在 Windows 和 linux 上,有人试图创建尽可能小的可执行文件;在 Linux、ELF、Windows、PE 中。

Both run through what is removed and why and you use assemblers to construct ELF files without using the -felf like options that do it for you.

两者都运行了删除的内容和原因,并且您使用汇编程序构建 ELF 文件,而无需使用 -felf 之类的选项来为您执行此操作。

Hope that helps.

希望有帮助。

Edit - you could also take a look at the assembly for a bootloader like the one in truecrypt http://www.truecrypt.orgor "stage1" of grub (the bit that actually gets written to the MDR).

编辑 - 您还可以查看类似 truecrypt http://www.truecrypt.org或 grub 的“stage1”(实际写入 MDR 的位)中的引导加载程序的程序集。