pandas 何时使用类别而不是对象？

Question

提问by user4640449

I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

我有一个 CSV 数据集，其中包含 40 个我正在使用 Pandas 处理的功能。7 个特征是连续的 ( int32)，其余特征是分类的。

My question is :

我的问题是：

Should I use the dtype('category')of Pandas for the categorical features, or can I let the default dtype('object')?

我应该将dtype('category')Pandas 用于分类特征，还是可以使用默认值dtype('object')？

Answer 1

回答by chrisaycock

Use a category when there is lots of repetition that you expect to exploit.

当您希望利用大量重复时，请使用类别。

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default objectis totally reasonable:

例如，假设我想要一个大型交易表的每个交易所的总规模。使用默认值object是完全合理的：

In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

但是由于可能的交换列表非常小，并且因为有很多重复，我可以通过使用一个来加快速度category：

In [7]: trades['exch'] = trades['exch'].astype('category')

In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 μs per loop

Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

请注意，类别实际上是一种动态枚举形式。如果可能值的范围是固定且有限的，则它们最有用。

Answer 2

回答by willk

The Pandas documentation has a concise sectionon when to use the categoricaldata type:

Pandas 文档有一个关于何时使用数据类型的简洁部分categorical：

The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

分类数据类型在以下情况下很有用：
仅由几个不同值组成的字符串变量。将这样的字符串变量转换为分类变量将节省一些内存，请参见此处。
变量的词汇顺序与逻辑顺序（“一”、“二”、“三”）不同。通过转换为分类并指定类别的顺序，排序和最小/最大将使用逻辑顺序而不是词法顺序，请参见此处。
作为其他 Python 库的信号，该列应被视为分类变量（例如，使用合适的统计方法或绘图类型）。

pandas 何时使用类别而不是对象？

提问by user4640449

回答by chrisaycock

回答by willk

相关推荐

最近更新

标签

pandas 何时使用类别而不是对象？

提问by user4640449

回答by chrisaycock

回答by willk

相关推荐

pandas Python：如何从熊猫系列的字典中获取值

将一个 Pandas 数据帧除以另一个 - 忽略索引但尊重列

将月份添加到 Pandas 中的日期时间列

Pandas to_sql 在重复主键上失败

相关推荐

最近更新

标签