Python 在scikit-learn中预处理后如何保留数据框的列标题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29586323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:45:08  来源:igfitidea点击:

How to retain column headers of data frame after Pre-processing in scikit-learn

pythonnumpypandasscikit-learn

提问by Baktaawar

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

我有一个 Pandas 数据框,它有一些行和列。每列都有一个标题。现在只要我继续在 Pandas 中进行数据操作,我的变量头就会被保留。但是,如果我尝试使用 Sci-kit-learn 库的一些数据预处理功能,我最终会丢失所有标题,并且框架将转换为仅数字矩阵。

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

我理解为什么会发生这种情况,因为 scikit-learn 给出了一个 numpy ndarray 作为输出。而 numpy ndarray 只是矩阵不会有列名。

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

但事情就是这样。如果我在我的数据集上构建一些模型,即使在初始数据预处理和尝试一些模型之后,我可能需要做一些更多的数据操作任务来运行一些其他模型以获得更好的拟合。无法访问列标题使数据操作变得困难,因为我可能不知道特定变量的索引是什么,但通过执行 df.columns 更容易记住变量名称甚至查找。

How to overcome that?

如何克服呢?

EDIT1: Editing with sample data snapshot.

EDIT1:使用示例数据快照进行编辑。

    Pclass  Sex Age SibSp   Parch   Fare    Embarked
0   3   0   22  1   0   7.2500  1
1   1   1   38  1   0   71.2833 2
2   3   1   26  0   0   7.9250  1
3   1   1   35  1   0   53.1000 1
4   3   0   35  0   0   8.0500  1
5   3   0   NaN 0   0   8.4583  3
6   1   0   54  0   0   51.8625 1
7   3   0   2   3   1   21.0750 1
8   3   1   27  0   2   11.1333 1
9   2   1   14  1   0   30.0708 2
10  3   1   4   1   1   16.7000 1
11  1   1   58  0   0   26.5500 1
12  3   0   20  0   0   8.0500  1
13  3   0   39  1   5   31.2750 1
14  3   1   14  0   0   7.8542  1
15  2   1   55  0   0   16.0000 1

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

以上基本上就是pandas数据框。现在,当我在此数据框上执行此操作时,它将去除列标题。

from sklearn import preprocessing 
X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed

New data is of numpy array and hence the column names are stripped.

新数据是 numpy 数组,因此列名被剥离。

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,
          7.25      ,   1.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
         71.2833    ,   2.        ],
       [  3.        ,   1.        ,  26.        , ...,   0.        ,
          7.925     ,   1.        ],
       ..., 
       [  3.        ,   1.        ,  29.69911765, ...,   2.        ,
         23.45      ,   1.        ],
       [  1.        ,   0.        ,  26.        , ...,   0.        ,
         30.        ,   2.        ],
       [  3.        ,   0.        ,  32.        , ...,   0.        ,
          7.75      ,   3.        ]])

So I want to retain the column names when I do some data manipulation on my pandas data frame.

因此,当我对 Pandas 数据框进行一些数据操作时,我想保留列名。

回答by selwyth

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputedas the sklearn.preprocessingoutput and X_trainas the original dataframe, you can put the column headers back on with:

在大多数情况下,scikit-learn 确实会去除列标题,因此只需在之后重新添加它们即可。在您的示例中,X_imputed作为sklearn.preprocessing输出和X_train原始数据框,您可以将列标题放回:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)

回答by AChervony

According to Ami Tavory'sreply here, per documentation, Imputer omits empty columns or rows (however you run it).
Thus, before running the Imputer and setting the column names as described above, run something like this (for columns):

根据Ami Tavory在此处回复,根据文档,Imputer 会忽略空列或行(无论您如何运行它)。
因此,在运行 Imputer 并如上所述设置列名之前,运行如下(对于列):

X_train=X_train.dropna(axis=1, how='all')

df.dropna described here.

df.dropna 在这里描述。

回答by Anya Linley

Adapted from part of the intermediate machine learning course on Kaggle:

改编自 Kaggle 中级机器学习课程的一部分:

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X = pd.DataFrame(my_imputer.fit_transform(X))

# Imputation removed column names; put them back
imputed_X.columns = X.columns