pandas 熊猫:get_dummies 与分类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29221894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:05:27  来源:igfitidea点击:

Pandas: get_dummies vs categorical

pythonpandascategorical-datadummy-data

提问by sapo_cosmico

I have a dataset which has a few columns with categorical data.

我有一个数据集,其中有几列包含分类数据。

I've been using the Categorical function to replace categorical values with numerical ones.

我一直在使用 Categorical 函数将分类值替换为数字值。

data[column] = pd.Categorical.from_array(data[column]).codes

I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?

我最近遇到了 pandas.get_dummies 函数。这些可以互换吗?使用一个比另一个有优势吗?

回答by Alexander

Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.

为什么要将分类数据转换为整数?如果这是您的目标,我不相信您会节省内存。

df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes

>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes

The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummiesreturns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.

分类代码只是给定类别中唯一项目的整数值。相比之下,get_dummies为每个唯一项返回一个新列。列中的值指示记录是否具有该属性。

>>> pd.core.reshape.get_dummies(df)
Out[30]: 
   cat_a  cat_b  cat_c
0      1      0      0
1      1      0      0
2      1      0      0
3      0      1      0
4      0      1      0
5      0      0      1

To get the codes directly, you can use:

要直接获取代码,您可以使用:

df['codes'] = [df.cat.codes.to_list()]