Python DataFrame 中的字符串,但 dtype 是对象

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21018654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:48:53  来源:igfitidea点击:

Strings in a DataFrame, but dtype is object

pythonpandasnumpytypesseries

提问by Xiphias

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

为什么 Pandas 告诉我我有对象,尽管所选列中的每个项目都是一个字符串——即使在显式转换之后也是如此。

This is my DataFrame:

这是我的数据帧:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

其中五个是dtype object。我显式地将这些对象转换为字符串:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

Then, df["attr2"]still has dtype object, although type(df["attr2"].ix[0]reveals str, which is correct.

然后,df["attr2"]还有dtype object,虽然type(df["attr2"].ix[0]显示str,这是正确的。

Pandas distinguishes between int64and float64and object. What is the logic behind it when there is no dtype str? Why is a strcovered by object?

Pandas 区分int64andfloat64object。当没有时,它背后的逻辑是什么dtype str?为什么 astr被 覆盖object

采纳答案by HYRY

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

dtype 对象来自 NumPy,它描述了 ndarray 中元素的类型。ndarray 中的每个元素都必须具有相同的字节大小。对于 int64 和 float64,它们是 8 个字节。但是对于字符串,字符串的长度是不固定的。因此,Pandas 不是直接将字符串的字节保存在 ndarray 中,而是使用对象 ndarray,它保存指向对象的指针,因此这种 ndarray 的 dtype 是对象。

Here is an example:

下面是一个例子:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.
  • int64 数组包含 4 个 int64 值。
  • 对象数组包含 4 个指向 3 个字符串对象的指针。

enter image description here

在此处输入图片说明

回答by The Red Pea

The accepted answer is good. Just wanted to provide an answer which referenced the documentation. The documentation says:

接受的答案很好。只是想提供一个参考文档的答案。文档说:

Pandas uses the object dtype for storing strings.

Pandas 使用对象 dtype 来存储字符串。

As the leading comment says "Don't worry about it; it's supposed to be like this." (Although the accepted answer did a great job explaining the "why"; strings are variable-length)

正如主要评论所说:“别担心;它应该是这样的。” (尽管接受的答案很好地解释了“为什么”;字符串是可变长度的)

But for strings, the length of the string is not fixed.

但是对于字符串,字符串的长度是不固定的。

回答by fuglede

As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype.

从 1.0.0 版(2020 年 1 月)开始,pandas 引入了一项实验性功能,通过pandas.StringDtype.

While you'll still be seeing objectby default, the new type can be used by specifying a dtypeof pd.StringDtypeor simply 'string':

虽然object默认情况下您仍然会看到,但可以通过指定 a dtypeofpd.StringDtype或简单地使用新类型'string'

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

回答by Ben

@HYRY's answer is great. I just want to provide a little more context..

@HYRY 的回答很棒。我只想提供更多背景信息..

Arrays stored data as contiguous, fixed-sizememory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

数组将数据存储为连续的固定大小的内存块。这些属性的组合使数组的数据访问速度快如闪电。例如,请考虑您的计算机如何存储 32 位整数数组[3,0,1].

enter image description here

在此处输入图片说明

If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

如果您让计算机获取数组中的第三个元素,它将从开头开始,然后跳过 64 位以到达第三个元素。确切知道要跳过多少位是使数组快速的原因

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.

现在考虑字符串序列['hello', 'i', 'am', 'a', 'banana']。字符串是大小不一的对象,所以如果你试图将它们存储在连续的内存块中,它最终会看起来像这样。

enter image description here

在此处输入图片说明

Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

现在,您的计算机无法快速访问随机请求的元素。克服这个问题的关键是使用指针。基本上,将每个字符串存储在某个随机内存位置,并用每个字符串的内存地址填充数组。(内存地址只是整数。)所以现在,事情看起来像这样

enter image description here

在此处输入图片说明

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

现在,如果您让计算机像以前一样获取第三个元素,它可以跳过 64 位(假设内存地址是 32 位整数),然后再执行一个额外的步骤来获取字符串。

The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.

NumPy 面临的挑战是无法保证指针实际上指向字符串。这就是为什么它将 dtype 报告为“对象”。

Shamelessly gonna plug my own blog articlewhere I originally discussed this.

无耻地将我自己的博客文章插入我最初讨论的地方。