pandas 使用fancyimpute和pandas进行数据插补

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45239256/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:03:49  来源:igfitidea点击:

Data imputation with fancyimpute and pandas

pythonpython-3.xpandasimputationfancyimpute

提问by Rachel

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandasand/or scikitunfortunately doens't do the trick).

我有一个大Pandas数据名声df。它有很多缺失。删除行/或 col-wise 不是一个选项。插补中位数、均值或最频繁的值也不是一个选项(因此插补pandas和/或scikit不幸的是并没有做到这一点)。

I came across what seems to be a neat package called fancyimpute(you can find it here). But I have some problems with it.

我遇到了一个看起来很整洁的包fancyimpute(你可以在这里找到它)。但我有一些问题。

Here is what I do:

这是我所做的:

#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN

# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

However, df_filledis a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?

但是,df_filled不知何故是单个向量,而不是填充的数据框。如何通过插补获得数据框?

Update

更新

I realized, fancyimputeneeds a numpay array. I hence converted the df_numericto a an array using as_matrix().

我意识到,fancyimpute需要一个numpay array. 因此,我使用 将 转换为df_numeric数组as_matrix()

# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?

输出是一个列标签丢失的数据框。有什么方法可以检索标签?

采纳答案by NicolasWoloszko

df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)

The np.arraythat is returned by the .complete()method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=)of a pandas dataframe whose cols and indexes are the same as the original data frame.

np.array由所述返回.complete()的fancyimpute对象的方法(无论是小鼠或KNN)被供给作为内容(argument data=)一个大Pandas数据帧,其COLS和索引中的相同的原始数据帧。

回答by Miriam Farber

Add the following lines after your code:

在您的代码后添加以下几行:

df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index

回答by jander081

I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.

我看到了花哨的impute和pandas的挫败感。这是一个使用递归覆盖方法的相当基本的包装器。接收并输出一个数据框 - 列名完好无损。这些类型的包装器与管道配合得很好。

from fancyimpute import SoftImpute

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        return pd.DataFrame(z, index=X.index, columns=X.columns)

回答by Beau Hilton

I really appreciate @jander081's approach, and expanded on it a tiny bit to deal with setting categorical columns. I had a problem where the categorical columns would get unset and create errors during training, so modified the code as follows:

我真的很欣赏@jander081 的方法,并对其进行了一点扩展以处理设置分类列。我遇到了一个问题,即分类列会在训练期间未设置并产生错误,因此将代码修改如下:

from fancyimpute import SoftImpute
import pandas as pd

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        df = pd.DataFrame(z, index=X.index, columns=X.columns)
        cats = list(X.select_dtypes(include='category'))
        df[cats] = df[cats].astype('category')

        # return pd.DataFrame(z, index=X.index, columns=X.columns)
        return df