string 字符串长度是否等于字节大小?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/409765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:21:21  来源:igfitidea点击:

Does a strings length equal the byte size?

stringbyte

提问by Darryl Hein

Exactly that: Does a strings length equal the byte size? Does it matter on the language?

正是这样:字符串长度是否等于字节大小?跟语言有关系吗?

I think it is, but I just want to make sure.

我想是的,但我只是想确定一下。

Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.

附加信息:我只是想知道一般情况。我的具体情况是 PHP 和 MySQL。

As the answer is no, that's all I need know.

答案是否定的,这就是我所需要知道的。

回答by Toon Krijthe

Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.

不。零终止的字符串有一个额外的字节。pascal 字符串(Delphi 短字符串)有一个额外的字节长度。并且 unicode 字符串每个字符有一个以上的字节。

By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.

通过 unicode,它取决于编码。每个字符可以是 2 或 4 个字节,甚至可以是 1,2 和 4 字节的混合。

回答by Jon Skeet

It entirely depends on the platform and representation.

这完全取决于平台和代表。

For example, in .NET a string takes two bytes in memoryper UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.

例如,在 .NET 中,字符串每个 UTF-16 代码点的内存中占用两个字节。但是,对于 U+100000 到 U+10FFFF 范围内的完整 Unicode 字符,代理对需要两个 UTF-16 值。内存形式也有字符串长度的开销,可能还有一些填充,以及类型指针等的正常对象开销。

Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.

现在,当您从 .NET 将字符串写入磁盘(或网络等)时,您可以指定编码(大多数类默认为 UTF-8)。那时,大小在很大程度上取决于编码。ASCII 每个字符总是占用一个字节,但非常有限(没有重音等);UTF-8 使用可变编码提供完整的 Unicode 范围(所有 ASCII 字符都以单个字节表示,但其他字符占用更多)。对于任何 Unicode 字符,UTF-32 总是恰好使用 4 个字节——这个列表还在继续。

As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactlywhat the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.

如您所见,这不是一个简单的话题。要计算字符串将占用多少空间,您需要准确指定情况 - 它是否是某个平台上内存中的对象(如果是,则是哪个平台 - 甚至可能下降到实现和操作系统设置),或者它是否是原始编码形式,例如文本文件,如果是,则使用哪种编码。

回答by Steven Robbins

It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.

这取决于你所说的“长度”是什么意思。如果您的意思是“字符数”,那么不,许多语言/编码方法每个字符使用一个以上的字节。

回答by Malfist

Not always, it depends on the encoding.

并非总是如此,这取决于编码。

回答by joel.neely

There's no single answer; it depends on language andimplementation (remember that some languages have multiple implementations!)

没有唯一的答案;这取决于语言实现(请记住,某些语言有多种实现!)

Zero-terminated ASCII strings occupy at leastone more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)

以零结尾的 ASCII 字符串至少比字符串的“内容”多一个字节。(可能会分配更多,具体取决于字符串的创建方式。)

Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.

非零终止字符串使用描述符(或类似结构)来记录长度,这会在某处占用额外的内存。

Unicode strings (in various languages) use two bytes per char.

Unicode 字符串(在各种语言中)每个字符使用两个字节。

Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.

对象存储中的字符串可以通过句柄引用,这增加了一个间接层(和更多数据)以简化内存管理。

回答by theschmitzer

You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.

你是对的。如果您编码为 ASCII,则每个字符有一个字节。否则,每个字符是一个或多个字节。

In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

特别是,了解这如何影响子字符串操作非常重要。如果每个字符没有一个字节,那么 s[n] 会得到第 n 个字节还是第 n 个字符?对于大 n 而不是常量,获取第 n 个字符效率低下,因为每个字符一个字节。