C/C++ 为什么对二进制数据使用无符号字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13642381/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C/C++ Why to use unsigned char for binary data?
提问by nightlytrails
Is it really necessary to use unsigned char
to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
是否真的有必要unsigned char
像在某些处理字符编码或二进制缓冲区的库中那样使用来保存二进制数据?要理解我的问题,请查看下面的代码-
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = 'char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's
output correctly, where
f0 a4 ad a2
is the encoding for the Unicode code-point U+24B62 ()
in hex.
两个printf's
输出都正确,其中
f0 a4 ad a2
Unicode 代码点的编码是U+24B62 ()
十六进制的。
Even memcpy
also correctly copied the bits held by a char.
甚至memcpy
还正确复制了字符所保存的位。
What reasoning could possibly advocate the use of unsigned char
instead of a plain char
?
什么推理可能提倡使用 ofunsigned char
而不是 a plain char
?
In other related questions unsigned char
is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
在其他相关问题unsigned char
中突出显示,因为它是 C 规范保证没有填充的唯一(字节/最小)数据类型。但正如上面的例子所示,输出似乎不受任何填充的影响。
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
我已经使用 VC++ Express 2010 和 MinGW 来编译上述内容。虽然VC给出了警告
warning C4309: '=' : truncation of constant value
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
输出似乎没有反映这一点。
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer?but my intent is different. I am asking why something which seems to be working as fine with char
should be typed unsigned char
?
PS 这可以标记为一个可能的副本字节缓冲区是否有符号或无符号字符缓冲区?但我的意图不同。我在问为什么char
应该输入似乎可以正常工作的东西unsigned char
?
Update:To quote from N3337,
更新:引自 N3337,
Section 3.9 Types
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.
2 对于任何可简单复制类型 T 的对象(除基类子对象外),无论该对象是否持有 T 类型的有效值,构成该对象的底层字节 (1.7) 都可以复制到一个 char 数组中或无符号字符。如果 char 或 unsigned char 数组的内容被复制回对象,则该对象随后应保持其原始值。
In view of the above fact and that my original example was on Intel machine where char
defaults to signed char
, am still not convinced if unsigned char
should be preferred over char
.
鉴于上述事实以及我最初的示例是在char
默认为 的英特尔机器上signed char
,我仍然不相信是否unsigned char
应该优先考虑char
。
Anything else?
还要别的吗?
回答by Jens Gustedt
In C the unsigned char
data type is the only data type that has all the following three properties simultaneously
在 C 中,unsigned char
数据类型是唯一同时具有以下三个属性的数据类型
- it has no padding bits, that it where all storage bits contribute to the value of the data
- no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
- it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
- 它没有填充位,所有存储位都对数据的值有贡献
- 没有从该类型的值开始的按位运算,当转换回该类型时,会产生溢出、陷阱表示或未定义的行为
- 它可以在不违反“别名规则”的情况下为其他数据类型别名,即通过不同类型的指针访问相同数据将保证看到所有修改
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char
.
如果这些是您正在寻找的“二进制”数据类型的属性,那么您绝对应该使用unsigned char
.
For the second property we need a type that is unsigned
. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1
, 256
in most 99% of the architectures. All conversion of wider values to unsigned char
thereby just corresponds to truncation to the least significant byte.
对于第二个属性,我们需要一个类型为unsigned
。对于这些UCHAR_MAX+1
,256
在大多数 99% 的体系结构中,所有转换都是用模算术定义的,这里是 modulo 。较宽值的所有转换都unsigned char
对应于截断到最低有效字节。
The two other character types generally don't work the same. signed char
is signed, anyhow, so conversion of values that don't fit it is not well defined. char
is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
其他两种字符类型通常不一样。signed char
无论如何,已签名,因此不适合它的值的转换没有明确定义。char
不固定为签名或未签名,但在您的代码移植到的特定平台上,即使它在您的代码中未签名,它也可能被签名。
回答by Tom Tanner
You'll get most of your problems when comparing the contents of individual bytes:
比较单个字节的内容时,您会遇到大部分问题:
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = 'printf("%s\n", (void*)c);
';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
可以打印“bad”,因为根据您的编译器,c[0] 将被符号扩展为 -1,这与 0xff 不同
回答by Lundin
The plain char
type is problematic and shouldn't be used for anything but strings. The main problem with char
is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char
different from int
etc, int
is always guaranteed to be signed.
普通char
类型是有问题的,不应该用于字符串以外的任何东西。主要问题char
是您无法知道它是有符号的还是无符号的:这是实现定义的行为。这使得char
与int
etc不同,int
总是保证被签名。
Although VC gave the warning ... truncation of constant value
虽然VC给出了警告......常量值的截断
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
它告诉您,您正在尝试将 int 文字存储在 char 变量中。这可能与有符号性有关:如果您尝试在有符号字符中存储值 > 0x7F 的整数,则可能会发生意外情况。正式地说,这是 C 中未定义的行为,尽管实际上如果尝试将结果打印为存储在(有符号)字符中的整数值,您只会得到一个奇怪的输出。
In this specific case, the warning shouldn't matter.
在这种特定情况下,警告应该无关紧要。
EDIT :
编辑 :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
在其他相关问题中, unsigned char 被突出显示,因为它是唯一的(字节/最小)数据类型,C 规范保证它没有填充。
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
理论上,除 unsigned char 和 signed char 之外的所有整数类型都允许包含“填充位”,根据 C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits."
“对于 unsigned char 以外的无符号整数类型,对象表示的位应分为两组:值位和填充位(后者不需要任何一个)。”
“对于有符号整数类型,对象表示的位应分为三组:值位、填充位和符号位。不需要任何填充位;有符号字符不应有任何填充位。”
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
C 标准故意模糊和模糊,允许这些理论填充位,因为:
- It allows different symbol tables than the standard 8-bit ones.
- It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
- An integer may not necessarily use all bits allocated.
- 它允许使用不同于标准 8 位符号表的符号表。
- 它允许实现定义的符号和奇怪的有符号整数格式,例如补码或“符号和大小”。
- 整数可能不一定使用分配的所有位。
However, in the real world outside the C standard, the following applies:
但是,在 C 标准之外的现实世界中,以下内容适用:
- Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_twhen implementing symbols tables larger than 8 bits.
- Signedness is always two's complement.
- An integer always uses all bits allocated.
- 符号表几乎肯定是 8 位(UTF8 或 ASCII)。存在一些奇怪的例外,但是在实现大于 8 位的符号表时,干净的实现使用标准类型wchar_t。
- 签名总是两个的补码。
- 整数总是使用分配的所有位。
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
所以没有真正的理由使用 unsigned char 或 signed char 来避免 C 标准中的一些理论场景。
回答by Paolo Brandoli
Bytes are usually intended as unsigned 8 bit wide integers.
字节通常用作无符号的 8 位宽整数。
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
现在,char 不指定整数的符号:在某些编译器上,char 可能是有符号的,而在其他编译器上,它可能是无符号的。
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
如果我在您编写的代码中添加位移操作,那么我将有一个未定义的行为。添加的比较也会有意想不到的结果。
##代码##Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
关于编译过程中的警告:如果字符是有符号的,那么您正在尝试分配值 0xf0,该值不能在有符号字符(范围 -128 到 +127)中表示,因此它将被转换为有符号值(- 16)。
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
将 char 声明为 unsigned 将消除警告,并且在没有任何警告的情况下进行干净的构建总是好的。
回答by Sander De Dycker
The signed-ness of the plain char
type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char
or unsigned char
.
普通char
类型的签名是实现定义的,因此除非您实际处理字符数据(使用平台字符集的字符串 - 通常是 ASCII),否则通常最好使用signed char
或来明确指定签名unsigned char
。
For binary data, the best choice is most probably unsigned char
, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
对于二进制数据,最好的选择很可能是unsigned char
,特别是如果将对数据执行按位运算(特别是位移,对于有符号类型和无符号类型的行为不同)。
回答by utnapistim
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
在某些处理字符编码或二进制缓冲区的库中,是否真的有必要使用 unsigned char 来保存二进制数据?
"really" necessary? No.
“真的”有必要吗?不。
It is a very good idea though, and there are many reasons for this.
不过,这是一个非常好的主意,并且有很多原因。
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
您的示例使用 printf,它不是类型安全的。也就是说, printf 从格式字符串而不是从数据类型中获取格式提示。你可以很容易地尝试:
##代码##... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
......结果是一样的。如果你用 c++ iostreams 尝试同样的事情,结果会有所不同(取决于 c 的符号)。
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
什么推理可能会提倡使用无符号字符而不是普通字符?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).
Signed 指定数据的最高有效位(对于 unsigned char 为第 8 位)表示符号。由于您显然不需要那个,您应该指定您的数据是无符号的(“符号”位代表数据,而不是其他位的符号)。
回答by chill
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char
.
那么,你怎么称呼“二进制数据”呢?这是一堆位,没有任何由称为“二进制数据”的软件特定部分分配给它们的含义。最接近的原始数据类型是什么,它传达了对这些位中的任何一个都没有任何特定含义的想法?我想unsigned char
。
回答by Philipp
I am asking why something which seems to be working as fine with char should be typed unsigned char?
我在问为什么应该将似乎与 char 一起工作的东西输入 unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
如果你做了标准意义上不“正确”的事情,你就依赖于未定义的行为。你的编译器今天可能会按照你想要的方式来做,但你不知道它明天会做什么。你不知道 GCC 做什么或 VC++ 2012。或者即使行为取决于外部因素或调试/发布编译等。一旦你离开标准的安全路径,你可能会遇到麻烦。