Python:以字节为单位获取字符串的大小

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30686701/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:49:19  来源:igfitidea点击:

Python : Get size of string in bytes

python

提问by Iffat Fatima

I have a string that is to be sent over a network. I need to check the total bytes it is represented in.

我有一个要通过网络发送的字符串。我需要检查它代表的总字节数。

sys.getsizeof(string_name)returns extra bytes. For example for sys.getsizeof("a")returns 22 , while one character is only represented in 1 byte in python. Is there some other method to find this ?

sys.getsizeof(string_name)返回额外的字节。例如对于sys.getsizeof("a")返回 22 ,而一个字符在 python 中仅用 1 个字节表示。有没有其他方法可以找到这个?

采纳答案by Kris

If you want the number of bytes in a string, this function should do it for you pretty solidly.

如果您想要字符串中的字节数,此函数应该非常可靠地为您完成。

def utf8len(s):
    return len(s.encode('utf-8'))

The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in python.

你得到奇怪数字的原因是因为字符串是python中的实际对象,所以封装在字符串中的是一堆其他信息。

Its interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).

这很有趣,因为如果您查看我将字符串编码为 'utf-8' 的解决方案,则在 's' 对象(它是一个字符串)上有一个 'encode' 方法。好吧,它需要存储在某个地方,对吗?因此,高于正常字节数。它包括该方法以及其他一些方法:)。

回答by sboby

There's a caveat to the accepted answer.

接受的答案有一个警告。

For some multi-byte encodings (e.g. utf-16), string.encodewill add a Byte Order Mark(BOM) at the start, which is a sequence of special bytes that inform the reader on the byte endiannessused. So the length you get is actually len(BOM) + len(encoded_word).

对于某些多字节编码(例如 utf-16),string.encode会在开头添加一个字节顺序标记(BOM),这是一个特殊字节序列,用于通知读者所使用的字节顺序。所以你得到的长度实际上是len(BOM) + len(encoded_word).

If you don't want to count the BOM bytes, you can use either the little-endian version of the encoding (adding the suffix "-le") or the big-endian version (adding the suffix "be").

如果您不想计算 BOM 字节数,您可以使用小端版本的编码(添加后缀“-le”)或大端版本(添加后缀“be”)。

>>> len('ciao'.encode('utf-16'))
10
>>> len('ciao'.encode('utf-16-le'))
8