pandas 将分类变量从 String 转换为 int 表示

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41078003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:35:52  来源:igfitidea点击:

Convert categorical variables from String to int representation

pandasnumpyscikit-learn

提问by Abhi

I have a numpy array of classification of text in the form of String array, i.e. y_train = ['A', 'B', 'A', 'C',...]. I am trying to apply SKlearn multinomial NB algorithm to predict classes for entire dataset.

我有一个字符串数组形式的文本分类的 numpy 数组,即 y_train = ['A', 'B', 'A', 'C',...]. 我正在尝试应用 SKlearn 多项 NB 算法来预测整个数据集的类别。

I want to convert the String classes into integers to be able to input into the algorithm and convert ['A', 'B', 'A', 'C', ...]into ['1', '2', '1', '3', ...]

我想将字符串类转换为整数,以便能够输入算法并转换['A', 'B', 'A', 'C', ...]['1', '2', '1', '3', ...]

I can write a for loop to go through array and create a new one with int classifiers but is there a direct function to achieve this

我可以编写一个 for 循环来遍历数组并使用 int 分类器创建一个新的循环,但是是否有直接的函数来实现这一点

采纳答案by Ted Petrou

If you are using sklearn, I would suggest sticking with methods in that library that do these things for you. Sklearn has a number of ways of preprocessing data such as encoding labels. One of which is the sklearn.preprocessing.LabelEncoderfunction.

如果您正在使用 sklearn,我建议您坚持使用该库中为您执行这些操作的方法。Sklearn 有多种预处理数据的方法,例如编码标签。其中之一是sklearn.preprocessing.LabelEncoder功能。

from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()
le.fit_transform(y_train)

Which outputs

哪些输出

array([0, 1, 0, 2])

Use le.inverse_transform([0,1,2])to map back

使用le.inverse_transform([0,1,2])映射回

回答by MaxU

Try factorizemethod:

尝试分解方法:

In [264]: y_train = pd.Series(['A', 'B', 'A', 'C'])

In [265]: y_train
Out[265]:
0    A
1    B
2    A
3    C
dtype: object

In [266]: pd.factorize(y_train)
Out[266]: (array([0, 1, 0, 2], dtype=int64), Index(['A', 'B', 'C'], dtype='object'))

Demo:

演示:

In [271]: fct = pd.factorize(y_train)[0]+1

In [272]: fct
Out[272]: array([1, 2, 1, 3], dtype=int64)