Pandas 区分 str 和 object 类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34881079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas distinction between str and object types
提问by Meitham
Numpy seems to make a distinction between str
and object
types. For instance I can do ::
Numpy 似乎区分了str
和object
类型。例如我可以做 ::
>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')
Where dtype('S') and dtype('O') corresponds to str
and object
respectively.
其中 dtype('S') 和 dtype('O')分别对应于str
和object
。
However pandas seem to lack that distinction and coerce str
to object
. ::
但是Pandas似乎就缺少了区分,并要挟str
到object
。::
>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')
Forcing the type to dtype('S')
does not help either. ::
强制类型dtype('S')
也无济于事。::
>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')
Is there any explanation for this behavior?
这种行为有什么解释吗?
回答by Joe Kington
Numpy's string dtypes aren't python strings.
Numpy 的字符串数据类型不是 python 字符串。
Therefore, pandas
deliberately uses native python strings, which require an object dtype.
因此,pandas
故意使用需要对象 dtype 的原生 python 字符串。
First off, let me demonstrate a bit of what I mean by numpy's strings being different:
首先,让我演示一下 numpy 的字符串不同的含义:
In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, 'x' is a numpy
string dtype (fixed-width, c-like string) and y
is an array of native python strings.
现在,'x' 是一个numpy
字符串 dtype(固定宽度,类似 c 的字符串)并且y
是一个本地 python 字符串数组。
If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:
如果我们尝试超过 7 个字符,我们会立即看到差异。字符串 dtype 版本将被截断:
In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')
While the object dtype versions can be arbitrary length:
虽然对象 dtype 版本可以是任意长度:
In [6]: y[1] = 'a really really really long'
In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S
dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.
接下来,|S
dtype 字符串不能正确保存 unicode,尽管也有一个 unicode 固定长度字符串 dtype。暂时我会跳过一个例子。
Finally, numpy's strings are actually mutable, while Python strings are not. For example:
最后,numpy 的字符串实际上是可变的,而 Python 字符串则不是。例如:
In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
dtype='|S7')
For all of these reasons, pandas
chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas
. Instead, it always uses native python strings, which behave in a more intuitive way for most users.
由于所有这些原因,pandas
选择不允许类似 C 的固定长度字符串作为数据类型。正如您所注意到的,尝试将 python 字符串强制转换为固定的 numpy 字符串在pandas
. 相反,它始终使用原生 python 字符串,对于大多数用户来说,这些字符串的行为方式更直观。