C++ 空终止字符串的基本原理是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4418708/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 15:20:58  来源:igfitidea点击:

What's the rationale for null terminated strings?

c++cstringnull-terminated

提问by Billy ONeal

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

尽管我很喜欢 C 和 C++,但我还是忍不住要为空终止字符串的选择挠头:

  • Length prefixed (i.e. Pascal) strings existed before C
  • Length prefixed strings make several algorithms faster by allowing constant time length lookup.
  • Length prefixed strings make it more difficult to cause buffer overrun errors.
  • Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
  • Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
  • C++ rectified this a bit with the std::basic_stringtemplate, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
  • Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.
  • 在 C 之前存在长度前缀(即 Pascal)字符串
  • 长度前缀字符串通过允许恒定时间长度查找使几种算法更快。
  • 长度前缀的字符串更难导致缓冲区溢出错误。
  • 即使在 32 位机器上,如果您允许字符串为可用内存的大小,则长度前缀字符串仅比空终止字符串宽三个字节。在 16 位机器上,这是一个字节。在 64 位机器上,4GB 是一个合理的字符串长度限制,但即使你想把它扩展到机器字的大小,64 位机器通常有足够的内存,使额外的 7 个字节成为空参数。我知道最初的 C 标准是为非常糟糕的机器编写的(就内存而言),但是效率论点并不适合我。
  • 几乎所有其他语言(即 Perl、Pascal、Python、Java、C# 等)都使用长度前缀字符串。这些语言通常在字符串操作基准测试中胜过 C,因为它们处理字符串的效率更高。
  • C++ 使用std::basic_string模板对此进行了一些修正,但期望空终止字符串的纯字符数组仍然普遍存在。这也是不完美的,因为它需要堆分配。
  • 空终止字符串必须保留一个字符(即空值),该字符不能存在于字符串中,而长度前缀字符串可以包含嵌入的空值。

Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

其中一些事情比 C 更近才被曝光,因此 C 不知道它们是有道理的。然而,在 C 出现之前,有几个已经很清楚了。为什么会选择空终止字符串而不是明显优越的长度前缀?

EDIT: Since some asked for facts(and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

编辑:由于有些人在上面的效率点上要求提供事实(并且不喜欢我已经提供的事实),因此它们源于以下几点:

  • Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
  • Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
  • Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.
  • 使用空终止字符串的 Concat 需要 O(n + m) 时间复杂度。长度前缀通常只需要 O(m)。
  • 使用空终止字符串的长度需要 O(n) 时间复杂度。长度前缀是 O(1)。
  • Length 和 concat 是迄今为止最常见的字符串操作。有几种情况下空终止字符串可能更有效,但这些情况发生的频率要低得多。

From answers below, these are some cases where null terminated strings are more efficient:

从下面的答案中可以看出,在某些情况下,以空字符结尾的字符串效率更高:

  • When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
  • In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).
  • 当您需要切断字符串的开头并需要将其传递给某个方法时。即使您被允许销毁原始字符串,您也不能真正在恒定时间内使用长度前缀执行此操作,因为长度前缀可能需要遵循对齐规则。
  • 在某些情况下,您只是逐个字符地循环遍历字符串,您可能能够保存 CPU 寄存器。请注意,这仅在您没有动态分配字符串的情况下有效(因为这样您就必须释放它,需要使用您保存的 CPU 寄存器来保存您最初从 malloc 和朋友那里获得的指针)。

None of the above are nearly as common as length and concat.

以上都不像 length 和 concat 那样常见。

There's one more asserted in the answers below:

下面的答案中还有一个断言:

  • You need to cut off the end of the string
  • 您需要切断字符串的末尾

but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

但是这个是不正确的——对于空终止和长度前缀的字符串来说,它的时间是相同的。(空终止的字符串只是在你想要新结束的地方粘贴一个空值,长度前缀只是从前缀中减去。)

采纳答案by Hans Passant

From the horse's mouth

马嘴里

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

BCPL、B 或 C 都不强烈支持该语言中的字符数据;每个都将字符串视为整数向量,并通过一些约定补充一般规则。在 BCPL 和 B 中,字符串文字表示用字符串字符初始化的静态区域的地址,并打包到单元格中。在 BCPL 中,第一个压缩字节包含字符串中的字符数;在 B 中,没有计数,字符串以特殊字符结尾,B 拼写为 *e。进行此更改的部分原因是为了避免将计数保存在 8 位或 9 位插槽中导致的字符串长度限制,部分原因是根据我们的经验,维护计数似乎不如使用终止符方便。

Dennis M Ritchie, Development of the C Language

Dennis M Ritchie,C 语言的开发

回答by Robert S Ciaccio

C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.

C 语言中没有字符串。C 中的“字符串”只是指向 char 的指针。所以也许你问错了问题。

"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.

“省略字符串类型的理由是什么”可能更相关。对此,我要指出 C 不是面向对象的语言,只有基本的值类型。字符串是一个更高级别的概念,必须以某种方式组合其他类型的值来实现。C处于较低的抽象级别。

in light of the raging squall below:

鉴于下面的狂风:

I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

我只是想指出,我并不是想说这是一个愚蠢或糟糕的问题,或者表示字符串的 C 方式是最佳选择。我试图澄清,如果您考虑到 C 没有将字符串作为数据类型与字节数组区分开来的机制这一事实,那么问题会更简洁。鉴于当今计算机的处理能力和存储能力,这是最佳选择吗?可能不是。但事后看来总是 20/20 之类的:)

回答by kriss

The question is asked as a Length Prefixed Strings (LPS)vs zero terminated strings (SZ)thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

这个问题是作为Length Prefixed Strings (LPS)vs问题提出zero terminated strings (SZ)的,但主要是暴露了长度前缀字符串的好处。这可能看起来势不可挡,但老实说,我们还应该考虑 LPS 的缺点和 SZ 的优点。

As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

据我了解,这个问题甚至可以被理解为一种有偏见的方式来询问“零终止字符串的优点是什么?”。

Advantages (I see) of Zero Terminated Strings:

零终止字符串的优点(我明白了):

  • very simple, no need to introduce new concepts in language, char arrays/char pointers can do.
  • the core language just include minimal syntaxic sugar to convert something between double quotes to a bunch of chars (really a bunch of bytes). In some cases it can be used to initialize things completely unrelated with text. For instance xpm image file format is a valid C source that contains image data encoded as a string.
  • by the way, you canput a zero in a string literal, the compiler will just also add another one at the end of the literal: "this\0is\0valid\0C". Is it a string ? or four strings ? Or a bunch of bytes...
  • flat implementation, no hidden indirection, no hidden integer.
  • no hidden memory allocation involved (well, some infamous non standard functions like strdup perform allocation, but that's mostly a source of problem).
  • no specific issue for small or large hardware (imagine the burden to manage 32 bits prefix length on 8 bits microcontrollers, or the restrictions of limiting string size to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
  • implementation of string manipulation is just a handful of very simple library function
  • efficient for the main use of strings : constant text read sequentially from a known start (mostly messages to the user).
  • the terminating zero is not even mandatory, all necessary tools to manipulate chars like a bunch of bytes are available. When performing array initialisation in C, you can even avoid the NUL terminator. Just set the right size. char a[3] = "foo";is valid C (not C++) and won't put a final zero in a.
  • coherent with the unix point of view "everything is file", including "files" that have no intrinsic length like stdin, stdout. You should remember that open read and write primitives are implemented at a very low level. They are not library calls, but system calls. And the same API is used for binary or text files. File reading primitives get a buffer address and a size and return the new size. And you can use strings as the buffer to write. Using another kind of string representation would imply you can't easily use a literal string as the buffer to output, or you would have to make it have a very strange behavior when casting it to char*. Namely not to return the address of the string, but instead to return the actual data.
  • very easy to manipulate text data read from a file in-place, without useless copy of buffer, just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
  • prepending some int values of whatever size would implies alignment issues. The initial length should be aligned, but there is no reason to do that for the characters datas (and again, forcing alignment of strings would imply problems when treating them as a bunch of bytes).
  • length is known at compile time for constant literal strings (sizeof). So why would anyone want to store it in memory prepending it to actual data ?
  • in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
  • in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.
  • 很简单,不需要在语言中引入新概念,char数组/char指针就可以了。
  • 核心语言只包含最少的语法糖来将双引号之间的内容转换为一堆字符(实际上是一堆字节)。在某些情况下,它可用于初始化与文本完全无关的事物。例如,xpm 图像文件格式是包含编码为字符串的图像数据的有效 C 源。
  • 顺便说一句,您可以在字符串文字中放置一个零,编译器也会在文字末尾添加另一个:"this\0is\0valid\0C"。是字符串吗?还是四弦?或者一堆字节...
  • 平面实现,没有隐藏的间接性,没有隐藏的整数。
  • 不涉及隐藏的内存分配(好吧,一些臭名昭著的非标准函数,如 strdup 执行分配,但这主要是问题的根源)。
  • 对于小型或大型硬件没有特定问题(想象一下在 8 位微控制器上管理 32 位前缀长度的负担,或者将字符串大小限制为小于 256 字节的限制,这是我很久以前使用 Turbo Pascal 时遇到的问题)。
  • 字符串操作的实现只是少数非常简单的库函数
  • 字符串的主要用途是高效的:从已知开始顺序读取的常量文本(主要是给用户的消息)。
  • 终止零甚至不是强制性的,可以使用所有必要的工具来操作像一堆字节这样的字符。在 C 中执行数组初始化时,您甚至可以避免使用 NUL 终止符。只需设置正确的大小。char a[3] = "foo";是有效的 C(不是 C++)并且不会在 a.
  • 与 unix 的观点“一切都是文件”一致,包括没有像 stdin、stdout 那样固有长度的“文件”。您应该记住,开放式读写原语是在非常低的级别实现的。它们不是库调用,而是系统调用。相同的 API 用于二进制或文本文件。文件读取原语获取缓冲区地址和大小并返回新大小。您可以使用字符串作为缓冲区进行写入。使用另一种字符串表示形式意味着您不能轻松地使用文字字符串作为缓冲区来输出,或者您必须在将其转换为char*. 即不返回字符串的地址,而是返回实际数据。
  • 很容易操作从文件就地读取的文本数据,没有无用的缓冲区副本,只需在正确的位置插入零(好吧,现代 C 并不是真正的,因为现在双引号字符串是 const char 数组,现在通常保存在不可修改的数据中部分)。
  • 在任何大小之前添加一些 int 值都意味着对齐问题。初始长度应该对齐,但没有理由对字符数据这样做(同样,强制对齐字符串意味着将它们视为一堆字节时会出现问题)。
  • 对于常量文字字符串 (sizeof),长度在编译时是已知的。那么为什么有人想要将它存储在内存中,然后将其添加到实际数据之前呢?
  • 在某种程度上 C 就像(几乎)其他人一样,字符串被视为字符数组。由于数组长度不是由 C 管理的,因此对于字符串也不管理逻辑长度。唯一令人惊讶的是最后添加了 0 项,但这只是在双引号之间键入字符串时的核心语言级别。用户可以完美调用传递长度的字符串操作函数,甚至可以使用普通的 memcopy 来代替。SZ 只是一个设施。在大多数其他语言中,数组长度是受管理的,这对于字符串来说是相同的。
  • 在现代,无论如何 1 字节字符集是不够的,您经常必须处理编码的 unicode 字符串,其中字符数与字节数非常不同。这意味着用户可能想要的不仅仅是“尺寸”,还有其他信息。保持长度对这些其他有用的信息没有任何用处(尤其是没有自然的地方来存储它们)。

That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring? Or even C++ strings ?

也就是说,在标准 C 字符串确实效率低下的罕见情况下,无需抱怨。库是可用的。如果我遵循这个趋势,我应该抱怨标准 C 不包含任何正则表达式支持函数......但实际上每个人都知道这不是一个真正的问题,因为有可用于此目的的库。那么当需要字符串操作效率时,为什么不使用像bstring这样的库呢?甚至 C++ 字符串?

EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

编辑:我最近看了看D 弦。有趣的是,所选择的解决方案既不是大小前缀,也不是零终止。与在 C 中一样,用双引号括起来的文字字符串只是不可变字符数组的简写,并且该语言也有一个字符串关键字,意思是(不可变字符数组)。

But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

但是 D 数组比 C 数组丰富得多。在静态数组的情况下,长度在运行时是已知的,因此不需要存储长度。编译器在编译时拥有它。在动态数组的情况下,长度是可用的,但 D 文档没有说明它的保存位置。就我们所知,编译器可以选择将其保存在某个寄存器中,或者保存在远离字符数据的某个变量中。

On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

在普通的 char 数组或非文字字符串上没有最后的零,因此如果程序员想从 D 调用一些 C 函数,他必须自己放置它。在文字字符串的特殊情况下,但是 D 编译器仍然在每个字符串的末尾(允许轻松转换为 C 字符串,以便更轻松地调用 C 函数?),但此零不是字符串的一部分(D 不将其计入字符串大小)。

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

唯一让我有些失望的是字符串应该是 utf-8,但即使使用多字节字符,长度显然仍然返回许多字节(至少在我的编译器 gdc 上是这样)。我不清楚这是编译器错误还是故意的。(好吧,我可能已经发现发生了什么。要对 D 编译器说您的源代码使用 utf-8,您必须在开头放置一些愚蠢的字节顺序标记。我写愚蠢是因为我知道编辑器不会这样做,尤其是对于 UTF- 8 应该是 ASCII 兼容的)。

回答by khachik

I think, it has historical reasons and found this in wikipedia:

我认为,它有历史原因,并在维基百科中找到了这一点

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

在开发 C(及其衍生语言)时,内存极其有限,因此仅使用一个字节的开销来存储字符串的长度很有吸引力。当时唯一流行的替代方法,通常称为“Pascal 字符串”(尽管 BASIC 的早期版本也使用),它使用前导字节来存储字符串的长度。这允许字符串包含 NUL 并使得查找长度只需要一次内存访问(O(1)(常量)时间)。但是一个字节将长度限制为 255。这个长度限制比 C 字符串的问题要严格得多,所以一般来说 C 字符串胜出。

回答by Daniel C. Sobral

Calaverais right, but as people don't seem to get his point, I'll provide some code examples.

Calavera对的,但由于人们似乎没有理解他的观点,我将提供一些代码示例。

First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to writethese things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).

首先,让我们考虑一下 C 是什么:一种简单的语言,其中所有代码都可以直接转换为机器语言。所有类型都适合寄存器和堆栈,并且它不需要操作系统或大型运行时库来运行,因为它旨在编写这些东西(这是一个非常适合的任务,考虑到那里甚至今天都不是一个可能的竞争对手)。

If C had a stringtype, like intor char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.

如果 C 有一个string类型,如intchar,它将是一种不适合寄存器或堆栈的类型,并且需要以任何方式处理内存分配(及其所有支持基础结构)。所有这些都违背了 C 的基本原则。

So, a string in C is:

因此,C 中的字符串是:

char s*;

So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:

所以,让我们假设这是长度前缀。让我们编写代码来连接两个字符串:

char* concat(char* s1, char* s2)
{
    /* What? What is the type of the length of the string? */
    int l1 = *(int*) s1;
    /* How much? How much must I skip? */
    char *s1s = s1 + sizeof(int);
    int l2 = *(int*) s2;
    char *s2s = s2 + sizeof(int);
    int l3 = l1 + l2;
    char *s3 = (char*) malloc(l3 + sizeof(int));
    char *s3s = s3 + sizeof(int);
    memcpy(s3s, s1s, l1);
    memcpy(s3s + l1, s2s, l2);
    *(int*) s3 = l3;
    return s3;
}

Another alternative would be using a struct to define a string:

另一种选择是使用结构来定义字符串:

struct {
  int len; /* cannot be left implementation-defined */
  char* buf;
}

At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.

此时,所有字符串操作都需要进行两次分配,这实际上意味着您将通过一个库来对其进行任何处理。

The funny thing is... structs like that doexist in C! They are just not used for your day-to-day displaying messages to the user handling.

有趣的是......在C中确实存在这样的结构!它们只是不用于您向用户处理的日常显示消息。

So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".

所以,这就是 Calavera 提出的观点:C 中没有字符串类型。要对它做任何事情,您必须获取一个指针并将其解码为指向两种不同类型的指针,然后字符串的大小变得非常相关,而不能只是“实现定义”。

Now, C canhandle memory in anyway, and the memfunctions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings"in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.

现在,C可以以任何方式处理内存,并且mem库中的函数(<string.h>甚至在 中!)提供了将内存作为一对指针和大小处理所需的所有工具。 在 C 中创建所谓的“字符串”只是为了一个目的:在编写用于文本终端的操作系统的上下文中显示消息。而且,为此,空终止就足够了。

回答by R.. GitHub STOP HELPING ICE

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlenor the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As J?rgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filenameor filename_to_extensionimpossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

显然,为了性能和安全性,您需要在使用字符串时保持字符串的长度,而不是重复执行strlen或执行等效操作。但是,将长度存储在字符串内容之前的固定位置是一种非常糟糕的设计。正如 J?rgen 在对 Sanjit 的回答的评论中指出的那样,它排除了将字符串的尾部视为字符串的可能性,例如path_to_filenamefilename_to_extension如果不分配新的内存(并导致失败和错误的可能性),例如或不可能进行许多常见操作处理)。然后当然还有一个问题是没有人可以同意字符串长度字段应该占用多少字节(大量糟糕的“Pascal string”

C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.

C让程序员选择是否/在哪里/如何存储长度的设计更加灵活和强大。但当然程序员必须很聪明。C 通过崩溃、停止或让你的敌人扎根的程序来惩罚愚蠢。

回答by dvhh

Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code). You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).

考虑到任何语言的汇编语言,尤其是 C 语言,它比汇编语言高级一步(因此继承了许多汇编遗留代码)的惰性、注册节俭性和可移植性。您会同意,在那些 ASCII 时代,空字符将毫无用处,它(可能与 EOF 控制字符一样好)。

let's see in pseudo code

让我们用伪代码看看

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

total 1 register use

共 1 个寄存器使用

case 2

案例2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

total 2 register used

总共使用了 2 个寄存器

That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.

这在当时可能看起来很短视,但考虑到代码和寄存器的节俭(当时是 PREMIUM,你知道的时候,他们使用穿孔卡)。因此,速度更快(当处理器速度可以用 kHz 计算时),这个“Hack”非常好,并且可以轻松移植到无寄存器处理器。

For argument sake I will implement 2 common string operation

为了论证,我将实现 2 个常见的字符串操作

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).

复杂度为 O(n),其中在大多数情况下,PASCAL 字符串为 O(1),因为字符串的长度被预先添加到字符串结构中(这也意味着此操作必须在较早的阶段进行)。

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.

复杂性 O(n) 和预先添加字符串长度不会改变操作的复杂性,而我承认它会花费 3 倍的时间。

On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.

另一方面,如果您使用 PASCAL 字符串,则必须重新设计 API 以考虑寄存器长度和位字节序,PASCAL 字符串具有众所周知的 255 个字符 (0xFF) 限制,因为长度存储在 1 个字节(8 位)中),并且如果您想要更长的字符串(16 位-> 任何内容),您必须在代码的一层中考虑架构,这意味着在大多数情况下,如果您想要更长的字符串,则字符串 API 不兼容。

Example:

例子:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes. Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble. 1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string. Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

一个文件是在 8 位计算机上用预先添加的字符串 api 编写的,然后必须在 32 位计算机上读取,惰性程序会怎么做,认为您的 4 字节是字符串的长度,然后分配那么多内存然后尝试读取那么多字节。另一种情况是将 PPC 32 字节字符串读取(小端)到 x86(大端),当然,如果您不知道一个是由另一个写入的,那将会有麻烦。1 字节长度 (0x00000001) 将变为 16777216 (0x0100000),即读取 1 字节字符串需要 16 MB。当然,您会说人们应该就一个标准达成一致,但即使是 16 位 unicode 也很少有大字节序。

Of course C would have its issues too but, would be very little affected by the issues raised here.

当然,C 也会有它的问题,但是受这里提出的问题的影响很小。

回答by Jonathan Wood

In many ways, C was primitive. And I loved it.

在许多方面,C 是原始的。我喜欢它。

It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.

它比汇编语言高出一步,使用更易于编写和维护的语言为您提供几乎相同的性能。

The null terminator is simple and requires no special support by the language.

空终止符很简单,不需要语言的特殊支持。

Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.

回想起来,好像没那么方便。但是我在 80 年代使用汇编语言,当时看起来很方便。我只是认为软件在不断发展,平台和工具不断变得越来越复杂。

回答by Cristian

Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)

假设 C 以 Pascal 方式实现字符串,通过在它们前面加上长度前缀:7 个字符长的字符串与 3 个字符的字符串具有相同的数据类型吗?如果答案是肯定的,那么当我将前者分配给后者时,编译器应该生成什么样的代码?字符串应该被截断,还是自动调整大小?如果调整大小,该操作是否应受锁保护以使其线程安全?C 方法方面解决了所有这些问题,不管你喜欢与否 :)

回答by Pyry Jahkola

Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:

不知何故,我理解这个问题意味着 C 中没有编译器支持长度前缀字符串。 以下示例显示,至少您可以启动自己的 C 字符串库,在编译时计算字符串长度,其构造如下:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows ##代码## chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal chararray).

但是,这不会有任何问题,因为您需要注意何时专门释放该字符串指针以及何时静态分配(文字char数组)。

Edit:As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.

编辑:作为对这个问题的更直接的回答,我认为这是 C 可以同时支持字符串长度可用(作为编译时间常数)的方式,如果你需要它,但如果你想使用它仍然没有内存开销只有指针和零终止。

Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.

当然,使用以零结尾的字符串似乎是推荐的做法,因为标准库通常不将字符串长度作为参数,并且因为提取长度不像 那样简单char * s = "abc",正如我的示例所示。