Pandas 数据框中的分类变量?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23450735/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:00:00  来源:igfitidea点击:

Categorical Variables In A Pandas Dataframe?

pythonpandascategorical-data

提问by Anton

I am working my way through Wes's Python For Data Analysis, and I've run into a strange problem that is not addressed in the book.

我正在研究 Wes 的 Python For Data Analysis,但遇到了书中没有解决的奇怪问题。

In the code below, based on page 199 of his book, I create a dataframe and then use pd.cut()to create cat_obj. According to the book, cat_objis

在下面的代码中,基于他书中的第 199 页,我创建了一个数据框,然后使用它pd.cut()来创建cat_obj. 根据书,cat_obj

"a special Categorical object. You can treat it like an array of strings indicating the bin name; internally it contains a levels array indicating the distinct category names along with a labeling for the ages data in the labels attribute"

“一个特殊的 Categorical 对象。您可以将其视为指示 bin 名称的字符串数组;在内部,它包含一个指示不同类别名称的 levels 数组以及标签属性中年龄数据的标签”

Awesome! However, if I use the exact same pd.cut()code (In [5] below) to create a new column of the dataframe (called df['cat']), that column is not treated as a special categorical variablebut simply as a regular pandas series.

惊人的!但是,如果我使用完全相同的pd.cut()代码(在下面的 [5] 中)创建数据框的新列(称为df['cat']),则该列不会被视为特殊的分类变量,而只会被视为常规的Pandas系列。

How, then, do I create a column in a dataframe that is treated as a categorical variable?

那么,如何在被视为分类变量的数据框中创建一列?

In [4]:

import pandas as pd

raw_data = {'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'score': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['name', 'score'])

bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']

In [5]:
cat_obj = pd.cut(df['score'], bins, labels=group_names)
df['cat'] = pd.cut(df['score'], bins, labels=group_names)
In [7]:

type(cat_obj)
Out[7]:
pandas.core.categorical.Categorical
In [8]:

type(df['cat'])
Out[8]:
pandas.core.series.Series

回答by xrage

It might be happening because of this kind of behaviour by setter-:

这可能是由于 setter- 的这种行为而发生的:

Sample getter and setter-

示例 getter 和 setter-

class a:
    x = 1
    @property
    def p(self):
        return int(self.x)

    @p.setter
    def p(self,v):
        self.x = v
t = 1.32
a().p = 1.32


print type(t) --> <type 'float'>
print type(a().p) --> <type 'int'>

For now dfonly accepts Series dataand its setter converts Categorial datainto Series. dfcategorial support is due in Next Pandas release.

现在df只接受Series data并且它的 setter 转换Categorial dataSeries. df类别支持将在 Next Pandas 版本中到期。

回答by undershock

From http://pandas-docs.github.io/pandas-docs-travis/categorical.html, from pandas 0.15 onwards

http://pandas-docs.github.io/pandas-docs-travis/categorical.html 开始从Pandas 0.15 开始

Specify dtype="category" when constructing a Series:

构造系列时指定 dtype="category":

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

You can then add this to an existing series.

然后,您可以将其添加到现有系列中。

Or convert an existing Series or column to a category dtype:

或者将现有的系列或列转换为类别 dtype:

In [3]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [4]: df["B"] = df["A"].astype('category')

In [5]: df
Out[5]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

回答by jmxp

Right now, you can't have categorical data in a Series or DataFrame object, but this functionality will be implemented in Pandas 0.15(due in September).

现在,您不能在 Series 或 DataFrame 对象中包含分类数据,但此功能将在Pandas 0.15(九月到期)中实现。