pandas 何时使用类别而不是对象?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30601830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:25:16  来源:igfitidea点击:

When to use Category rather than Object?

pythoncsvpandastypesdataset

提问by user4640449

I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

我有一个 CSV 数据集,其中包含 40 个我正在使用 Pandas 处理的功能。7 个特征是连续的 ( int32),其余特征是分类的。

My question is :

我的问题是:

Should I use the dtype('category')of Pandas for the categorical features, or can I let the default dtype('object')?

我应该将dtype('category')Pandas 用于分类特征,还是可以使用默认值dtype('object')

回答by chrisaycock

Use a category when there is lots of repetition that you expect to exploit.

当您希望利用大量重复时,请使用类别。

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default objectis totally reasonable:

例如,假设我想要一个大型交易表的每个交易所的总规模。使用默认值object是完全合理的:

In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

但是由于可能的交换列表非常小,并且因为有很多重复,我可以通过使用一个来加快速度category

In [7]: trades['exch'] = trades['exch'].astype('category')

In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 μs per loop


Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

请注意,类别实际上是一种动态枚举形式。如果可能值的范围是固定且有限的,则它们最有用。

回答by willk

The Pandas documentation has a concise sectionon when to use the categoricaldata type:

Pandas 文档有一个关于何时使用数据类型的简洁部分categorical

The categorical data type is useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

分类数据类型在以下情况下很有用:

  • 仅由几个不同值组成的字符串变量。将这样的字符串变量转换为分类变量将节省一些内存,请参见此处
  • 变量的词汇顺序与逻辑顺序(“一”、“二”、“三”)不同。通过转换为分类并指定类别的顺序,排序和最小/最大将使用逻辑顺序而不是词法顺序,请参见 此处
  • 作为其他 Python 库的信号,该列应被视为分类变量(例如,使用合适的统计方法或绘图类型)。