为什么 C 字符文字是整数而不是字符？

Question

提问by Joseph Garvin

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a'is a character literal, and sizeof(char) == 1as defined by the standard.

在 C++ 中，sizeof('a') == sizeof(char) == 1. 这很直观，因为它'a'是一个字符文字，并且sizeof(char) == 1由标准定义。

In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.

然而，在 C 中，sizeof('a') == sizeof(int). 也就是说，看起来 C 字符文字实际上是整数。有谁知道为什么？我可以找到很多关于这个 C 怪癖的提及，但没有解释它为什么存在。

Answer 1

采纳答案by Malx

discussion on same subject

同一主题的讨论

"More specifically the integral promotions. In K&R C it was virtually (?) impossible to use a character value without it being promoted to int first, so making character constant int in the first place eliminated that step. There were and still are multi character constants such as 'abcd' or however many will fit in an int."

“更具体地说是整数提升。在 K&R C 中，如果不先将字符值提升为 int，实际上（？）不可能使用字符值，因此首先使字符常量 int 消除了这一步。曾经并且仍然是多字符诸如 'abcd' 之类的常量，或者有多少可以放入 int 中。”

Answer 2

回答by John Vincent

The original question is "why?"

最初的问题是“为什么？”

The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.

原因是文字字符的定义已经演变和改变，同时试图保持与现有代码的向后兼容。

In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.

在早期 C 的黑暗时代，根本没有类型。当我第一次学习用 C 编程时，类型已经被引入，但是函数没有原型来告诉调用者参数类型是什么。取而代之的是，作为参数传递的所有内容都是标准化的，要么是 int 的大小（这包括所有指针），要么是 double。

This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.

这意味着当您编写函数时，所有不是双精度的参数都作为整数存储在堆栈中，无论您如何声明它们，并且编译器将代码放入函数中来为您处理。

This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.

这让事情变得有些不一致，所以当 K&R 写他们那本著名的书时，他们制定了这样的规则，即在任何表达式中，字符文字总是被提升为 int，而不仅仅是函数参数。

When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.

当 ANSI 委员会第一次对 C 进行标准化时，他们改变了这个规则，使得字符文字只是一个 int，因为这似乎是实现同样事情的更简单的方法。

When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.

在设计 C++ 时，要求所有函数都具有完整的原型（这在 C 中仍然没有要求，尽管它被普遍接受为良好实践）。因此，决定可以将字符文字存储在 char 中。这在 C++ 中的优点是带有 char 参数的函数和带有 int 参数的函数具有不同的签名。在 C 中没有这种优势。

This is why they are different. Evolution...

这就是它们不同的原因。进化...

Answer 3

回答by Johannes Schaub - litb

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

我不知道为什么 C 中的字符文字是 int 类型的具体原因。但在 C++ 中，有充分的理由不走那条路。考虑一下：

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab'has type int, while 'a'has type char.

您会期望对 print 的调用选择采用字符的第二个版本。将字符文字设为 int 会使这变得不可能。请注意，在 C++ 中，具有多个字符的文字仍然具有 int 类型，尽管它们的值是实现定义的。所以，'ab'有 type int，而'a'有 type char。

Answer 4

回答by dmckee --- ex-moderator kitten

using gcc on my MacBook, I try:

在我的 MacBook 上使用 gcc，我尝试：

#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

运行时给出：

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

这表明一个字符是 8 位，就像您怀疑的那样，但字符文字是一个 int。

Answer 5

回答by Tony Delroy

Back when C was being written, the PDP-11's MACRO-11 assembly language had:

在编写 C 时，PDP-11 的 MACRO-11 汇编语言具有：

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

这种事情在汇编语言中很常见 - 低 8 位将保存字符代码，其他位清零。 PDP-11 甚至有：

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

这提供了一种将两个字符加载到 16 位寄存器的低字节和高字节的便捷方法。然后你可以在别处写那些，更新一些文本数据或屏幕内存。

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

因此，将字符提升到寄存器大小的想法是非常正常和可取的。但是，假设您需要将 'A' 放入寄存器中，而不是作为硬编码操作码的一部分，而是从主内存中的某个位置包含：

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

如果你只想从这个主存储器中读取一个“A”到一个寄存器中，你会读哪个？

Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

某些 CPU 可能只直接支持将 16 位值读入 16 位寄存器，这意味着读取 20 或 22 位需要清除“X”中的位，这取决于 CPU 的字节序需要移入低位字节。
某些 CPU 可能需要内存对齐读取，这意味着所涉及的最低地址必须是数据大小的倍数：您可能能够从地址 24 和 25 读取，但不能从地址 27 和 28 读取。

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

因此，生成代码以将 'A' 放入寄存器的编译器可能更愿意浪费一点额外的内存并将值编码为 0 'A' 或 'A' 0 - 取决于字节顺序，并确保它正确对齐（即不在奇数内存地址处）。

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

我的猜测是，C 只是承载了这种以 CPU 为中心的行为，考虑到字符常量占用内存的寄存器大小，支持 C 作为“高级汇编程序”的共同评估。

(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)

（参见http://www.dmv.net/dec/pdf/macro.pdf第 6-25 页上的 6.3.3 ）

Answer 6

回答by Michael Burr

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

我还没有看到它的基本原理（C 字符文字是 int 类型），但 Stroustrup 不得不说一些（来自 Design and Evolution 11.2.1 - Fine-Grain Resolution）：

In C, the type of a character literal such as 'a'is int. Surprisingly, giving 'a'type charin C++ doesn't cause any compatibility problems. Except for the pathological example sizeof('a'), every construct that can be expressed in both C and C++ gives the same result.

在 C 中，字符文字的类型，例如'a'is int。令人惊讶的是，在 C++ 中给出'a'类型char不会导致任何兼容性问题。除了病理学示例之外sizeof('a')，可以在 C 和 C++ 中表达的每个构造都给出了相同的结果。

So for the most part, it should cause no problems.

所以在大多数情况下，它应该不会引起任何问题。

Answer 7

回答by Kyle Cronin

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.

我记得阅读 K&R 并看到一个代码片段，它会一次读取一个字符，直到它到达 EOF。由于所有字符都是文件/输入流中的有效字符，这意味着 EOF 不能是任何字符值。代码所做的是将读取的字符放入 int，然后测试 EOF，如果不是，则转换为 char。

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

我意识到这并不能完全回答你的问题，但如果 EOF 文字是 sizeof(int) ，那么其余字符文字是有意义的。

int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

Answer 8

回答by Davislor

The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined intto be the native word size of the machine, and any value smaller than an intneeded to be widened to intin order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

其历史原因在于，C 及其前身 B 最初是在各种型号的 DEC PDP 小型机上开发的，具有各种字长，支持 8 位 ASCII，但只能对寄存器进行算术运算。（不过，不是 PDP-11；后来才出现。）C 的早期版本定义int为机器的本机字长，并且任何小于 a 的值都int需要加宽以int传递给函数或从函数传递，或用于按位、逻辑或算术表达式，因为这是底层硬件的工作方式。

That is also why the integer promotion rules still say that any data type smaller than an intis promoted to int. C implementations are also allowed to use one's-complement math instead of two's-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.

这也是整数提升规则仍然说任何小于 an 的数据类型int都被提升为的原因int。由于类似的历史原因，C 实现也被允许使用 one's-complement math 而不是 two's-complement。与十六进制相比，八进制字符转义和八进制常量是一等公民的原因同样是那些早期的 DEC 小型计算机的字大小可分为三字节块而不是四字节半字节。

Answer 9

回答by PolyThinker

This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

这是正确的行为，称为“整体提升”。它也可能发生在其他情况下（主要是二元运算符，如果我没记错的话）。

EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start witha type int. It is initially of type charbut when it is used in an expression, it is promotedto an int. The following is quoted from the book:

编辑：可以肯定的是，我检查了我的Expert C Programming: Deep Secrets副本，我确认 char 文字不以类型int开头。它是最初类型的炭，但是当它在使用表达式，它被提升到一个INT。以下是从书中引用的：

Character literals have type int and they get there by following the rules for promotion from type char. This is too briefly covered in K&R 1, on page 39 where it says:
Every char in an expression is converted into an int....Notice that all float's in an expression are converted to double....Since a function argument is an expression, type conversions also take place when arguments are passed to functions: in particular, char and short become int, float becomes double.

字符文字具有 int 类型，它们通过遵循从 char 类型提升的规则来实现。这在第 39 页的 K&R 1 中进行了简要介绍，其中说：
表达式中的每个字符都被转换为 int....注意，表达式中的所有浮点数都被转换为 double....因为函数参数是一个表达式，所以当参数传递给函数时也会发生类型转换：特别是，char 和short 变为int，float 变为double。

Answer 10

回答by Crashworks

This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

这仅与语言规范相切，但在硬件中，CPU 通常只有一个寄存器大小——比如说 32 位——所以每当它实际处理一个字符时（通过加、减或比较它）有当它被加载到寄存器时隐式转换为 int。编译器负责在每次操作后正确屏蔽和移动数字，以便如果您将 2 添加到（无符号字符）254，它将环绕为 0 而不是 256，但在硅片内部它实际上是一个 int直到您将其保存回内存。

It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.

这是一个学术观点，因为无论如何该语言都可以指定一个 8 位文字类型，但在这种情况下，语言规范恰好更接近地反映了 CPU 真正在做什么。

(x86 wonks may note that there is ega native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)

（x86 技术人员可能会注意到，例如有一个本地 addh 操作，它可以一步添加短宽寄存器，但在 RISC 内核内部，这转化为两步：添加数字，然后扩展符号，就像一个 add/extsh 对PowerPC）

为什么 C 字符文字是整数而不是字符？

提问by Joseph Garvin

采纳答案by Malx

回答by John Vincent

回答by Johannes Schaub - litb

回答by dmckee --- ex-moderator kitten

回答by Tony Delroy

回答by Michael Burr

回答by Kyle Cronin

回答by Davislor

回答by PolyThinker

回答by Crashworks

相关推荐

最近更新

标签

为什么 C 字符文字是整数而不是字符？

提问by Joseph Garvin

采纳答案by Malx

回答by John Vincent

回答by Johannes Schaub - litb

回答by dmckee --- ex-moderator kitten

回答by Tony Delroy

回答by Michael Burr

回答by Kyle Cronin

回答by Davislor

回答by PolyThinker

回答by Crashworks

相关推荐

如何在 C++ 中获取进程的起始/基地址？

C++ 执行 CMD 命令

C++ 我可以在不使用朋友的情况下从课堂外访问私人成员吗？

C++中向量的叉积

相关推荐

最近更新

标签