pandas 使用 scikit-learn 对连续变量和分类变量(整数类型)进行特征预处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43554821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:27:04  来源:igfitidea点击:

Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn

pythonpandasmachine-learningscikit-learncategorical-data

提问by James Wong

The main goals are as follows:

主要目标如下:

1) Apply StandardScalerto continuous variables

1) 适用StandardScaler于连续变量

2) Apply LabelEncoderand OnehotEncoderto categorical variables

2) 将LabelEncoderandOnehotEncoder应用于分类变量

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScalerwould result in undesired effects.

连续变量需要缩放,但同时,几个分类变量也是整数类型。应用StandardScaler会导致不希望的效果。

On the flip side, the StandardScalerwould scale the integer based categorical variables, which is also not we what.

另一方面,StandardScaler将缩放基于整数的分类变量,这也不是我们想要的。

Since continuous variables and categorical ones are mixed in a single PandasDataFrame, what's the recommended workflow to approach this kind of problem?

由于连续变量和分类变量混合在一个PandasDataFrame 中,解决此类问题的推荐工作流程是什么?

The best example to illustrate my point is the Kaggle Bike Sharing Demanddataset, where seasonand weatherare integer categorical variables

说明我的观点的最好例子是Kaggle Bike Sharing Demand数据集,其中seasonweather是整数分类变量

回答by user1808924

Check out the sklearn_pandas.DataFrameMappermeta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:

检查sklearn_pandas.DataFrameMapper元变压器。将其用作管道中执行按列数据工程操作的第一步:

mapper = DataFrameMapper(
  [(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
  [(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
  [("mapper", mapper),
  ("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])

Also, you should be using sklearn.preprocessing.LabelBinarizerinstead of a list of [LabelEncoder(), OneHotEncoder()].

此外,您应该使用sklearn.preprocessing.LabelBinarizer而不是[LabelEncoder(), OneHotEncoder()].