pandas 使用 scikit-learn 对连续变量和分类变量(整数类型)进行特征预处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43554821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn
提问by James Wong
The main goals are as follows:
主要目标如下:
1) Apply StandardScaler
to continuous variables
1) 适用StandardScaler
于连续变量
2) Apply LabelEncoder
and OnehotEncoder
to categorical variables
2) 将LabelEncoder
andOnehotEncoder
应用于分类变量
The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler
would result in undesired effects.
连续变量需要缩放,但同时,几个分类变量也是整数类型。应用StandardScaler
会导致不希望的效果。
On the flip side, the StandardScaler
would scale the integer based categorical variables, which is also not we what.
另一方面,StandardScaler
将缩放基于整数的分类变量,这也不是我们想要的。
Since continuous variables and categorical ones are mixed in a single Pandas
DataFrame, what's the recommended workflow to approach this kind of problem?
由于连续变量和分类变量混合在一个Pandas
DataFrame 中,解决此类问题的推荐工作流程是什么?
The best example to illustrate my point is the Kaggle Bike Sharing Demanddataset, where season
and weather
are integer categorical variables
说明我的观点的最好例子是Kaggle Bike Sharing Demand数据集,其中season
和weather
是整数分类变量
回答by user1808924
Check out the sklearn_pandas.DataFrameMapper
meta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:
检查sklearn_pandas.DataFrameMapper
元变压器。将其用作管道中执行按列数据工程操作的第一步:
mapper = DataFrameMapper(
[(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
[(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
[("mapper", mapper),
("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])
Also, you should be using sklearn.preprocessing.LabelBinarizer
instead of a list of [LabelEncoder(), OneHotEncoder()]
.
此外,您应该使用sklearn.preprocessing.LabelBinarizer
而不是[LabelEncoder(), OneHotEncoder()]
.