Python Pandas：将类别转换为数字

Question

提问by sachinruk

Suppose I have a dataframe with countries that goes as:

假设我有一个包含以下国家/地区的数据框：

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3]instead.

我知道有一个 pd.get_dummies 函数可以将国家/地区转换为“one-hot encodings”。但是，我希望将它们转换为索引，这样我就会得到cc_index = [1,2,1,3]。

I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

我假设有比使用 get_dummies 和 numpy where 子句更快的方法，如下所示：

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.

在 R 中使用“因子”更容易做到这一点，所以我希望熊猫有类似的东西。

Answer 1

回答by John Zwinck

First, change the type of the column:

首先，更改列的类型：

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

现在数据看起来很相似，但是是分类存储的。要捕获类别代码：

df['code'] = df.cc.cat.codes

Now you have:

现在你有：

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

如果您不想修改 DataFrame 而只是获取代码：

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

或者使用分类列作为索引：

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

Answer 2

回答by jpp

If you wish only to transform your series into integer identifiers, you can use pd.factorize.

如果您只想将系列转换为整数标识符，则可以使用pd.factorize.

Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:

请注意，此解决方案与不同pd.Categorical，不会按字母顺序排序。所以第一个国家将被分配0。如果你想从开始1，你可以添加一个常量：

df['code'] = pd.factorize(df['cc'])[0] + 1

print(df)

   cc  temp  code
0  US  37.0     1
1  CA  12.0     2
2  US  35.0     1
3  AU  20.0     3

If you wish to sort alphabetically, specify sort=True:

如果您希望按字母顺序排序，请指定sort=True：

df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1

Answer 3

回答by jpp

If you are using the sklearnlibrary you can use LabelEncoder. Like pd.Categorical, input strings are sorted alphabetically before encoding.

如果您正在使用该sklearn库，则可以使用LabelEncoder. 像一样pd.Categorical，输入字符串在编码之前按字母顺序排序。

from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])

print(df)

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

Answer 4

回答by Palepalli Surendra Reddy

Try this, convert to number based on frequency (high frequency - high number):

试试这个，转换为基于频率的数字（高频 - 高数字）：

labels = df[col].value_counts(ascending=True).index.tolist()
codes = range(1,len(labels)+1)
df[col].replace(labels,codes,inplace=True)

Answer 5

回答by Denis Kalyan

Will change any columns into Numbers. It will not create a new column but just replace the values with numerical data.

将任何列更改为数字。它不会创建新列，而只是用数字数据替换值。

def characters_to_numb(*args): for arg in args: df[arg] = pd.Categorical(df[arg]) df[arg] = df[arg].cat.codes return df

Answer 6

回答by Piotro

One-line code:

一行代码：

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes)

This works also if you have a list_of_columns:

如果您有list_of_columns：

df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

Furthermore, if you want to keep your NaNvalues you can apply a replace:

此外，如果您想保留您的NaN值，您可以应用替换：

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)

Python Pandas：将类别转换为数字

提问by sachinruk

回答by John Zwinck

回答by jpp

回答by jpp

回答by Palepalli Surendra Reddy

回答by Denis Kalyan

回答by Piotro

相关推荐

最近更新

标签

Python Pandas：将类别转换为数字

提问by sachinruk

回答by John Zwinck

回答by jpp

回答by jpp

回答by Palepalli Surendra Reddy

回答by Denis Kalyan

回答by Piotro

相关推荐

Python 如何在 Pandas DataFrame 中取消嵌套（爆炸）一列？

Python：如何循环列表并附加到新列表

Python按索引从字符串中删除字符的最佳方法

Python3 判断两个字典是否相等

相关推荐

最近更新

标签