pandas 使用 scikit-learn 对连续变量和分类变量（整数类型）进行特征预处理

Question

提问by James Wong

The main goals are as follows:

主要目标如下：

1) Apply StandardScalerto continuous variables

1) 适用StandardScaler于连续变量

2) Apply LabelEncoderand OnehotEncoderto categorical variables

2) 将LabelEncoderandOnehotEncoder应用于分类变量

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScalerwould result in undesired effects.

连续变量需要缩放，但同时，几个分类变量也是整数类型。应用StandardScaler会导致不希望的效果。

On the flip side, the StandardScalerwould scale the integer based categorical variables, which is also not we what.

另一方面，StandardScaler将缩放基于整数的分类变量，这也不是我们想要的。

Since continuous variables and categorical ones are mixed in a single PandasDataFrame, what's the recommended workflow to approach this kind of problem?

由于连续变量和分类变量混合在一个PandasDataFrame 中，解决此类问题的推荐工作流程是什么？

The best example to illustrate my point is the Kaggle Bike Sharing Demanddataset, where seasonand weatherare integer categorical variables

说明我的观点的最好例子是Kaggle Bike Sharing Demand数据集，其中season和weather是整数分类变量

Answer 1

回答by user1808924

Check out the sklearn_pandas.DataFrameMappermeta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:

检查sklearn_pandas.DataFrameMapper元变压器。将其用作管道中执行按列数据工程操作的第一步：

mapper = DataFrameMapper(
  [(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
  [(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
  [("mapper", mapper),
  ("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])

Also, you should be using sklearn.preprocessing.LabelBinarizerinstead of a list of [LabelEncoder(), OneHotEncoder()].

此外，您应该使用sklearn.preprocessing.LabelBinarizer而不是[LabelEncoder(), OneHotEncoder()].

pandas 使用 scikit-learn 对连续变量和分类变量（整数类型）进行特征预处理

提问by James Wong

回答by user1808924

相关推荐

最近更新

标签

pandas 使用 scikit-learn 对连续变量和分类变量（整数类型）进行特征预处理

提问by James Wong

回答by user1808924

相关推荐

在多核机器上加速 Pandas

pandas pyspark在ipython笔记本中将数据框显示为带有水平滚动的表格

pandas 在 Python 中使用 data.info() 显示所有信息

解析 Pandas 数据框

相关推荐

最近更新

标签