pandas Numpy dtype - 无法理解数据类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46329365/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:29:52  来源:igfitidea点击:

Numpy dtype - data type not understood

pythonpandasnumpy

提问by user1911092

I have a dataframe that I am looking at the data types associated with each column.

我有一个数据框,我正在查看与每列关联的数据类型。

When I run:

当我运行时:

In [23]: df.dtype.descr

Out [24]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|O')]

I want to set the currency dtype to S7. I am doing:

我想将货币 dtype 设置为 S7。我在做:

In [25]: dtype_new[-1] = (u'currency', "|S7")
In [26]: print dtype_new
Out [27]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|S7')]

It looks to be the correct format. So I try to put it back to my df:

它看起来是正确的格式。所以我试着把它放回我的 df:

In [28]: df = df.astype(np.dtype(dtype_new))

And I get the error:

我得到错误:

TypeError('data type not understood',)

What should I be changing? Thank you. This was working before I recently updated anaconda and I am not aware of the issue. Thanks.

我应该改变什么?谢谢你。这是在我最近更新 anaconda 之前工作的,我不知道这个问题。谢谢。

ADJUSTMENT:

调整:

df.dtype is

df.dtype 是

In [23]: records.dtype
Out[23]: dtype((numpy.record, [(u'date', '<i8'), (u'open', '<f8'), (u'high',     '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', 'O')]))

How can I change the '0' to a string less than 7 characters?

如何将“0”更改为少于 7 个字符的字符串?

How can I change the last dtype from 'O' to something else? Specifically a string less than 7 characters.

如何将最后一个 dtype 从 'O' 更改为其他类型?特别是少于 7 个字符的字符串。

LASTLY - is this a unicode issue? With Unicode:

最后 - 这是一个 unicode 问题吗?使用 Unicode:

In [38]: np.dtype([(u'date', '<i8')]) 
    ...: 
    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call     last)
<ipython-input-38-8702f0c7681f> in <module>()
----> 1 np.dtype([(u'date', '<i8')])

TypeError: data type not understood

No Unicode:

无 Unicode:

In [39]: np.dtype([('date', '<i8')])
Out[39]: dtype([('date', '<i8')])

采纳答案by fedepad

It seems you have centered the point about unicode and, actually, you seem to have touched on a sore point.

看来您已经将重点放在了 unicode 上,实际上,您似乎触及了一个痛点。

Let's start from the last numpy documentation.

让我们从最后一个 numpy 文档开始。

The documentation dtypesstates that:

文档dtypes指出:

[(field_name, field_dtype, field_shape), ...]

obj should be a list of fields where each field is described by a tuple of length 2 or 3. (Equivalent to the descritem in the __array_interface__attribute.)

The first element, field_name, is the field name (if this is ''then a standard field name, 'f#', is assigned). The field name may also be a 2-tuple of strings where the first string is either a “title” (which may be any string or unicode string) or meta-data for the field which can be any object, and the second string is the “name” which must be a valid Python identifier. The second element, field_dtype, can be anything that can be interpreted as a data-type. The optional third element field_shapecontains the shape if this field represents an array of the data-type in the second element. Note that a 3-tuple with a third argument equal to 1 is equivalent to a 2-tuple. This style does not accept align in the dtype constructor as it is assumed that all of the memory is accounted for by the array interface description.

[(field_name, field_dtype, field_shape), ...]

obj 应该是一个字段列表,其中每个字段由长度为 2 或 3 的元组描述。(相当于属性中的descr项目__array_interface__。)

第一个元素field_name是字段名称(如果这是''标准字段名称,则分配“f#”)。字段名称也可以是一个 2 元组的字符串,其中第一个字符串是“标题”(可以是任何字符串或 unicode 字符串)或字段的元数据,可以是任何对象,第二个字符串是“名称”必须是有效的 Python 标识符。第二个元素field_dtype可以是任何可以解释为数据类型的元素。可选的第三个元素field_shape如果此字段表示第二个元素中数据类型的数组,则包含形状。请注意,第三个参数等于 1 的 3 元组等效于 2 元组。这种样式不接受 dtype 构造函数中的对齐,因为它假定所有内存都由数组接口描述考虑。

So the doc doesn't seem to really specify whether the field name can be unicode, what we can be sure from the doc is that if we define a tuple as the field name, e.g. ((u'date', 'date'), '<i8'), then using unicode as the "title" (notice, still not for the name!), leads to no errors.
Otherwise, also in this case, if you define ((u'date', u'date'), '<i8')you will get an error.

所以文档似乎并没有真正指定字段名称是否可以是 unicode,我们可以从文档中确定的是,如果我们定义一个元组作为字段名称,例如((u'date', 'date'), '<i8'),然后使用 unicode 作为“标题”(注意,仍然不是名称!),不会导致任何错误。
否则,在这种情况下,如果您定义,((u'date', u'date'), '<i8')您将收到错误。

Now, you can use unicode names in Py2 by using the encode("ascii")

现在,您可以使用 Py2 中的 unicode 名称 encode("ascii")

(u'date'.encode("ascii"))  

and this should work.
One big point is that for Py2, Numpy does not allow to specify dtypewith unicode field names as list of tuples, but allows it using dictionaries.

这应该有效。
重要的一点是,对于 Py2,Numpy 不允许dtype使用 unicode 字段名称指定为元组列表,但允许使用字典。

If I don't use unicode names in Py2, I can change the last field from |0to |S7or you have to use the encode("ascii")if you define the name as unicode string.

如果我不在 Py2 中使用 unicode 名称,我可以将最后一个字段从|0to更改,|S7或者encode("ascii")如果您将名称定义为 unicode 字符串,则必须使用。



And the bugs involved...

以及所涉及的错误...

To understand why it happens what you see, it is useful to have a look at the bugs/issues reported in Numpy and Pandas and the relative discussions.

要了解为什么会发生这种情况,查看 Numpy 和 Pandas 中报告的错误/问题以及相关讨论很有用。

Numpy
https://github.com/numpy/numpy/issues/2407
You can notice in the discussion (which I do not report here) mainly a couple of things:

Numpy
https://github.com/numpy/numpy/issues/2407
你可以在讨论中注意到(我不在这里报告)主要有几件事:

  • the "issue" has been going on for a while
  • one trick people used was to use encode("ascii")on the unicode string
  • remember that the 'whatever'string has different defaults (bytes/unicode) in Py2/3
  • @hpaulj himself commented beautifully in that issue report that "If the dtype specification is of the list of tuples type, it checks whether each name is a string (as defined by py2 or 3) But if the dtype specification is a dictionary {'names':[ alist], 'formats':[alist]...}, the py2 case also allows unicode names"
  • “问题”已经持续了一段时间
  • 人们使用的一个技巧是encode("ascii")在 unicode 字符串上使用
  • 请记住,该'whatever'字符串在 Py2/3 中具有不同的默认值(字节/Unicode)
  • @hpaulj 本人在那个问题报告中评论得很漂亮,“如果 dtype 规范是元组类型的列表,它会检查每个名称是否是一个字符串(由 py2 或 3 定义)但是如果 dtype 规范是一个字典{'names':[ alist], 'formats':[alist]...},py2 case 还允许使用 unicode 名称”

Pandas
Also on the pandas side an issue has been reported which relates to the numpy issue: https://github.com/pandas-dev/pandas/pull/13462
It seems to have been fixed not that long ago.

Pandas
Pandas方面,已经报告了一个与 numpy 问题相关的问题:https: //github.com/pandas-dev/pandas/pull/13462
似乎不久前已经修复。

回答by Hagbard

I encountered this problem after upgrading numpy. Some previously working code suddenly stopped working after that. Reinstalling numpy solved the issue for me:

升级numpy后我遇到了这个问题。一些以前工作的代码在那之后突然停止工作。重新安装 numpy 为我解决了这个问题:

pip install --upgrade --force-reinstall numpy

pip install --upgrade --force-reinstall numpy