Pandas 区分 str 和 object 类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34881079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:31:59  来源:igfitidea点击:

pandas distinction between str and object types

pythonnumpypandas

提问by Meitham

Numpy seems to make a distinction between strand objecttypes. For instance I can do ::

Numpy 似乎区分了strobject类型。例如我可以做 ::

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

Where dtype('S') and dtype('O') corresponds to strand objectrespectively.

其中 dtype('S') 和 dtype('O')分别对应于strobject

However pandas seem to lack that distinction and coerce strto object. ::

但是Pandas似乎就缺少了区分,并要挟strobject。::

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

Forcing the type to dtype('S')does not help either. ::

强制类型dtype('S')也无济于事。::

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

Is there any explanation for this behavior?

这种行为有什么解释吗?

回答by Joe Kington

Numpy's string dtypes aren't python strings.

Numpy 的字符串数据类型不是 python 字符串。

Therefore, pandasdeliberately uses native python strings, which require an object dtype.

因此,pandas故意使用需要对象 dtype 的原生 python 字符串。

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

首先,让我演示一下 numpy 的字符串不同的含义:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpystring dtype (fixed-width, c-like string) and yis an array of native python strings.

现在,'x' 是一个numpy字符串 dtype(固定宽度,类似 c 的字符串)并且y是一个本地 python 字符串数组。

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

如果我们尝试超过 7 个字符,我们会立即看到差异。字符串 dtype 版本将被截断:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

While the object dtype versions can be arbitrary length:

虽然对象 dtype 版本可以是任意长度:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |Sdtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

接下来,|Sdtype 字符串不能正确保存 unicode,尽管也有一个 unicode 固定长度字符串 dtype。暂时我会跳过一个例子。

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

最后,numpy 的字符串实际上是可变的,而 Python 字符串则不是。例如:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

For all of these reasons, pandaschose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

由于所有这些原因,pandas选择不允许类似 C 的固定长度字符串作为数据类型。正如您所注意到的,尝试将 python 字符串强制转换为固定的 numpy 字符串在pandas. 相反,它始终使用原生 python 字符串,对于大多数用户来说,这些字符串的行为方式更直观。