如何将 Pandas 中的变量指定为有序/分类？

Question

提问by Baktaawar

I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A, which has values 1,2,3specifying the quality of something. 1:Upper, 2: Second, 3: Third class. So it's an ordinal variable.

我正在尝试使用 scikit-learn 在数据集上运行一些机器学习算法。我的数据集有一些类似于类别的特征。就像一个特征是A，它具有1,2,3指定某物质量的值。1:Upper, 2: Second, 3: Third class. 所以它是一个序数变量。

Similarly I re-coded a variable City, having three values ('London', Zurich', 'New York'into 1,2,3but with no specific preference for the values. So now this is a nominal categorical variable.

同样地，我再编码的可变City，具有三个值('London', Zurich', 'New York'成1,2,3，但没有具体的偏好值。所以现在这是一个名义分类变量。

How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a)and hence is not considered a continuous value. Is there anything like that in pandas/python?

我如何指定算法以将这些视为大Pandas中的分类和有序等？。与 R 中一样，分类变量由指定factor(a)，因此不被视为连续值。在Pandas/蟒蛇中有这样的东西吗？

Answer 1

回答by benjaminmgross

... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)

......多年后（因为我认为不仅需要对这些问题进行很好的解释，而且还需要在未来提醒自己）

Ordinal vs. Nominal

序数与名义

In general, one would translate categorical variables into dummy variables (or a host of other methodologies), becausethey were nominal, e.g. they had nosense of a > b > c. In OPs original question, this would onlybe performed on the Cities, like London, Zurich, New York.

通常，人们会将分类变量转换为虚拟变量（或许多其他方法），因为它们是名义变量，例如它们没有的意义a > b > c。在 OP 的原始问题中，这只会在伦敦、苏黎世、纽约等城市执行。

Dummy Variables for Nominal

名义上的虚拟变量

For this type of issue, pandasprovides -- by far -- the easiest transformation using pandas.get_dummies. So:

对于此类问题，pandas提供 - 迄今为止 - 使用pandas.get_dummies. 所以：

# create a sample of OPs unique values
series = pandas.Series(
           numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)

# now let's use pandas.get_dummies
print(
    pandas.get_dummies(series.replace(mpr))

Out[57]:
    London  New York  Zurich
0        0         0       1
1        0         1       0
2        0         1       0
3        1         0       0

Ordinal Encoding for Categorical Variables

分类变量的序数编码

However in the case of ordinal variables, the user must be cautious in using pandas.factorize. The reason is that the engineer wants to preserve the relationship in the mapping such that a > b > c.

但是对于序数变量，用户在使用时必须谨慎pandas.factorize。原因是工程师想要保留映射中的关系，使得a > b > c。

So if I want to take a set of categorical variables where large > medium > small, and preserve that, I need to make sure that pandas.factorizepreserves that relationship.

因此，如果我想采用一组分类变量 wherelarge > medium > small并保留它，我需要确保pandas.factorize保留该关系。

# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)

print(pandas.factorize(ordvar))

Out[58]:
(array([0, 1, 1, 2, 1,...  0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))

In fact, the relationship that needs to be preserved in order to maintain the concept of ordinalhas been lost using pandas.factorize. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.

实际上，为了保持序数概念而需要保留的关系已经使用pandas.factorize. 在这样的实例中，我使用自己的映射来确保保留序数属性。

preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))

Out[78]:
0     2
1     0
...
99    2
dtype: int64

In fact, by creating your own dictto map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.

事实上，通过创建自己dict的映射值不仅可以保留所需的序数关系，还可以用作“保持预测算法的内容和映射有条理”，确保您不仅不会丢失任何序数过程中的信息，而且还存储了每个变量的每个映射的记录。

`int`s into `sklearn`

`int`进入 `sklearn`

Lastly, the OP spoke about passing the information into scikit-leanclassifiers, which means that ints are required. For that case, make sure you're aware of the astype(int)gotchathat is detailed hereif you have any NaNs in your data.

最后，OP 谈到将信息传递给scikit-lean分类器，这意味着ints 是必需的。对于这种情况，如果您的数据中有任何s，请确保您了解此处详述的astype(int)问题。NaN

Answer 2

回答by dukebody

You should use the OneHotEncodertransformer with the categorical variables, and leave the ordinal variable untouched:

您应该将OneHotEncoder转换器与分类变量一起使用，并保持序数变量不变：

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> df = pd.DataFrame({'quality': [1, 2, 3], 'city': [3, 2, 1], columns=['quality', 'city']}
>>> enc = OneHotEncoder(categorical_features=[False, True])
>>> X = df.values
>>> enc.fit(X)
>>> enc.transform(X).todense()
matrix([[ 0.,  0.,  1.,  1.],
        [ 0.,  1.,  0.,  2.],
        [ 1.,  0.,  0.,  3.]])

Answer 3

回答by dartdog

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.htmland see this question How to reformat categorical Pandas variables for Sci-kit Learn

请参阅https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html并查看此问题How to reformat categorical Pandas variables for Sci-kit Learn

如何将 Pandas 中的变量指定为有序/分类？

提问by Baktaawar

回答by benjaminmgross

Ordinal vs. Nominal

序数与名义

Dummy Variables for Nominal

名义上的虚拟变量

Ordinal Encoding for Categorical Variables

分类变量的序数编码

`int`s into `sklearn`

`int`进入 `sklearn`

回答by dukebody

回答by dartdog

相关推荐

最近更新

标签

如何将 Pandas 中的变量指定为有序/分类？

提问by Baktaawar

回答by benjaminmgross

Ordinal vs. Nominal

序数与名义

Dummy Variables for Nominal

名义上的虚拟变量

Ordinal Encoding for Categorical Variables

分类变量的序数编码

ints into sklearn

int进入 sklearn

回答by dukebody

回答by dartdog

相关推荐

Python Pandas 用相反的符号替换值

Pandas 非常简单 来自 Group by 的总大小百分比

pandas 熊猫的问题

pandas 我可以使用 seaborn 在 x 轴上绘制带有日期时间的线性回归吗？

相关推荐

最近更新

标签

`int`s into `sklearn`

`int`进入 `sklearn`

Pandas 非常简单来自 Group by 的总大小百分比