pandas 何时使用类别而不是对象?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30601830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
When to use Category rather than Object?
提问by user4640449
I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.
我有一个 CSV 数据集,其中包含 40 个我正在使用 Pandas 处理的功能。7 个特征是连续的 ( int32),其余特征是分类的。
My question is :
我的问题是:
Should I use the dtype('category')of Pandas for the categorical features, or can I let the default dtype('object')?
我应该将dtype('category')Pandas 用于分类特征,还是可以使用默认值dtype('object')?
回答by chrisaycock
Use a category when there is lots of repetition that you expect to exploit.
当您希望利用大量重复时,请使用类别。
For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default objectis totally reasonable:
例如,假设我想要一个大型交易表的每个交易所的总规模。使用默认值object是完全合理的:
In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop
But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:
但是由于可能的交换列表非常小,并且因为有很多重复,我可以通过使用一个来加快速度category:
In [7]: trades['exch'] = trades['exch'].astype('category')
In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 μs per loop
Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.
请注意,类别实际上是一种动态枚举形式。如果可能值的范围是固定且有限的,则它们最有用。
回答by willk
The Pandas documentation has a concise sectionon when to use the categoricaldata type:
Pandas 文档有一个关于何时使用数据类型的简洁部分categorical:
The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
分类数据类型在以下情况下很有用:

