Pandas 区分 str 和 object 类型

Question

提问by Meitham

Numpy seems to make a distinction between strand objecttypes. For instance I can do ::

Numpy 似乎区分了str和object类型。例如我可以做 ::

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

Where dtype('S') and dtype('O') corresponds to strand objectrespectively.

其中 dtype('S') 和 dtype('O')分别对应于str和object。

However pandas seem to lack that distinction and coerce strto object. ::

但是Pandas似乎就缺少了区分，并要挟str到object。::

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

Forcing the type to dtype('S')does not help either. ::

强制类型dtype('S')也无济于事。::

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

Is there any explanation for this behavior?

这种行为有什么解释吗？

Answer 1

回答by Joe Kington

Numpy's string dtypes aren't python strings.

Numpy 的字符串数据类型不是 python 字符串。

Therefore, pandasdeliberately uses native python strings, which require an object dtype.

因此，pandas故意使用需要对象 dtype 的原生 python 字符串。

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

首先，让我演示一下 numpy 的字符串不同的含义：

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpystring dtype (fixed-width, c-like string) and yis an array of native python strings.

现在，'x' 是一个numpy字符串 dtype（固定宽度，类似 c 的字符串）并且y是一个本地 python 字符串数组。

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

如果我们尝试超过 7 个字符，我们会立即看到差异。字符串 dtype 版本将被截断：

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

While the object dtype versions can be arbitrary length:

虽然对象 dtype 版本可以是任意长度：

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |Sdtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

接下来，|Sdtype 字符串不能正确保存 unicode，尽管也有一个 unicode 固定长度字符串 dtype。暂时我会跳过一个例子。

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

最后，numpy 的字符串实际上是可变的，而 Python 字符串则不是。例如：

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

For all of these reasons, pandaschose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

由于所有这些原因，pandas选择不允许类似 C 的固定长度字符串作为数据类型。正如您所注意到的，尝试将 python 字符串强制转换为固定的 numpy 字符串在pandas. 相反，它始终使用原生 python 字符串，对于大多数用户来说，这些字符串的行为方式更直观。

Pandas 区分 str 和 object 类型

提问by Meitham

回答by Joe Kington

相关推荐

最近更新

标签

Pandas 区分 str 和 object 类型

提问by Meitham

回答by Joe Kington

相关推荐

Pandas：按四舍五入的浮点数分组

Pandas：当组中的值满足所需条件时从数据中删除组

pandas 在数据框中添加缺少的日期索引

pandas 替换熊猫数据框中的字符串

相关推荐

最近更新

标签