Python Pandas:将类别转换为数字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38088652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:20:13  来源:igfitidea点击:

Pandas: convert categories to numbers

pythonpandasseriescategorical-databinning

提问by sachinruk

Suppose I have a dataframe with countries that goes as:

假设我有一个包含以下国家/地区的数据框:

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3]instead.

我知道有一个 pd.get_dummies 函数可以将国家/地区转换为“one-hot encodings”。但是,我希望将它们转换为索引,这样我就会得到cc_index = [1,2,1,3]

I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

我假设有比使用 get_dummies 和 numpy where 子句更快的方法,如下所示:

[np.where(x) for x in df.cc.get_dummies().values]

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.

在 R 中使用“因子”更容易做到这一点,所以我希望熊猫有类似的东西。

回答by John Zwinck

First, change the type of the column:

首先,更改列的类型:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

现在数据看起来很相似,但是是分类存储的。要捕获类别代码:

df['code'] = df.cc.cat.codes

Now you have:

现在你有:

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

如果您不想修改 DataFrame 而只是获取代码:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

或者使用分类列作为索引:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

回答by jpp

If you wish only to transform your series into integer identifiers, you can use pd.factorize.

如果您只想将系列转换为整数标识符,则可以使用pd.factorize.

Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:

请注意,此解决方案与 不同pd.Categorical,不会按字母顺序排序。所以第一个国家将被分配0。如果你想从 开始1,你可以添加一个常量:

df['code'] = pd.factorize(df['cc'])[0] + 1

print(df)

   cc  temp  code
0  US  37.0     1
1  CA  12.0     2
2  US  35.0     1
3  AU  20.0     3

If you wish to sort alphabetically, specify sort=True:

如果您希望按字母顺序排序,请指定sort=True

df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1 

回答by jpp

If you are using the sklearnlibrary you can use LabelEncoder. Like pd.Categorical, input strings are sorted alphabetically before encoding.

如果您正在使用该sklearn库,则可以使用LabelEncoder. 像 一样pd.Categorical,输入字符串在编码之前按字母顺序排序。

from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])

print(df)

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

回答by Palepalli Surendra Reddy

Try this, convert to number based on frequency (high frequency - high number):

试试这个,转换为基于频率的数字(高频 - 高数字):

labels = df[col].value_counts(ascending=True).index.tolist()
codes = range(1,len(labels)+1)
df[col].replace(labels,codes,inplace=True)

回答by Denis Kalyan

Will change any columns into Numbers. It will not create a new column but just replace the values with numerical data.

将任何列更改为数字。它不会创建新列,而只是用数字数据替换值。

def characters_to_numb(*args): for arg in args: df[arg] = pd.Categorical(df[arg]) df[arg] = df[arg].cat.codes return df

def characters_to_numb(*args): for arg in args: df[arg] = pd.Categorical(df[arg]) df[arg] = df[arg].cat.codes return df

回答by Piotro

One-line code:

一行代码:

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes)

This works also if you have a list_of_columns:

如果您有list_of_columns

df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

Furthermore, if you want to keep your NaNvalues you can apply a replace:

此外,如果您想保留您的NaN值,您可以应用替换:

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)