如何将 Pandas 中的变量指定为有序/分类?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29528628/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:10:12  来源:igfitidea点击:

How to specify a variable in pandas as ordinal/categorical?

pythonpandasscikit-learncategorical-data

提问by Baktaawar

I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A, which has values 1,2,3specifying the quality of something. 1:Upper, 2: Second, 3: Third class. So it's an ordinal variable.

我正在尝试使用 scikit-learn 在数据集上运行一些机器学习算法。我的数据集有一些类似于类别的特征。就像一个特征是A,它具有1,2,3指定某物质量的值。1:Upper, 2: Second, 3: Third class. 所以它是一个序数变量。

Similarly I re-coded a variable City, having three values ('London', Zurich', 'New York'into 1,2,3but with no specific preference for the values. So now this is a nominal categorical variable.

同样地,我再编码的可变City,具有三个值('London', Zurich', 'New York'1,2,3,但没有具体的偏好值。所以现在这是一个名义分类变量。

How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a)and hence is not considered a continuous value. Is there anything like that in pandas/python?

我如何指定算法以将这些视为大Pandas中的分类和有序等?。与 R 中一样,分类变量由 指定factor(a),因此不被视为连续值。在Pandas/蟒蛇中有这样的东西吗?

回答by benjaminmgross

... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)

......多年后(因为我认为不仅需要对这些问题进行很好的解释,而且还需要在未来提醒自己)

Ordinal vs. Nominal

序数与名义

In general, one would translate categorical variables into dummy variables (or a host of other methodologies), becausethey were nominal, e.g. they had nosense of a > b > c. In OPs original question, this would onlybe performed on the Cities, like London, Zurich, New York.

通常,人们会将分类变量转换为虚拟变量(或许多其他方法),因为它们是名义变量,例如它们没有 的意义a > b > c。在 OP 的原始问题中,这只会在伦敦、苏黎世、纽约等城市执行。

Dummy Variables for Nominal

名义上的虚拟变量

For this type of issue, pandasprovides -- by far -- the easiest transformation using pandas.get_dummies. So:

对于此类问题,pandas提供 - 迄今为止 - 使用pandas.get_dummies. 所以:

# create a sample of OPs unique values
series = pandas.Series(
           numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)

# now let's use pandas.get_dummies
print(
    pandas.get_dummies(series.replace(mpr))

Out[57]:
    London  New York  Zurich
0        0         0       1
1        0         1       0
2        0         1       0
3        1         0       0

Ordinal Encoding for Categorical Variables

分类变量的序数编码

However in the case of ordinal variables, the user must be cautious in using pandas.factorize. The reason is that the engineer wants to preserve the relationship in the mapping such that a > b > c.

但是对于序数变量,用户在使用时必须谨慎pandas.factorize。原因是工程师想要保留映射中的关系,使得a > b > c

So if I want to take a set of categorical variables where large > medium > small, and preserve that, I need to make sure that pandas.factorizepreserves that relationship.

因此,如果我想采用一组分类变量 wherelarge > medium > small并保留它,我需要确保pandas.factorize保留该关系。

# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)

print(pandas.factorize(ordvar))

Out[58]:
(array([0, 1, 1, 2, 1,...  0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))

In fact, the relationship that needs to be preserved in order to maintain the concept of ordinalhas been lost using pandas.factorize. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.

实际上,为了保持序数概念而需要保留的关系已经使用pandas.factorize. 在这样的实例中,我使用自己的映射来确保保留序数属性。

preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))

Out[78]:
0     2
1     0
...
99    2
dtype: int64

In fact, by creating your own dictto map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.

事实上,通过创建自己dict的映射值不仅可以保留所需的序数关系,还可以用作“保持预测算法的内容和映射有条理”,确保您不仅不会丢失任何序数过程中的信息,而且还存储了每个变量的每个映射的记录。

ints into sklearn

int进入 sklearn

Lastly, the OP spoke about passing the information into scikit-leanclassifiers, which means that ints are required. For that case, make sure you're aware of the astype(int)gotchathat is detailed hereif you have any NaNs in your data.

最后,OP 谈到将信息传递给scikit-lean分类器,这意味着ints 是必需的。对于这种情况,如果您的数据中有任何s,请确保您了解此处详述的astype(int)问题NaN

回答by dukebody

You should use the OneHotEncodertransformer with the categorical variables, and leave the ordinal variable untouched:

您应该将OneHotEncoder转换器与分类变量一起使用,并保持序数变量不变:

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> df = pd.DataFrame({'quality': [1, 2, 3], 'city': [3, 2, 1], columns=['quality', 'city']}
>>> enc = OneHotEncoder(categorical_features=[False, True])
>>> X = df.values
>>> enc.fit(X)
>>> enc.transform(X).todense()
matrix([[ 0.,  0.,  1.,  1.],
        [ 0.,  1.,  0.,  2.],
        [ 1.,  0.,  0.,  3.]])