pandas 使用fancyimpute和pandas进行数据插补

Question

提问by Rachel

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandasand/or scikitunfortunately doens't do the trick).

我有一个大Pandas数据名声df。它有很多缺失。删除行/或 col-wise 不是一个选项。插补中位数、均值或最频繁的值也不是一个选项（因此插补pandas和/或scikit不幸的是并没有做到这一点）。

I came across what seems to be a neat package called fancyimpute(you can find it here). But I have some problems with it.

我遇到了一个看起来很整洁的包fancyimpute（你可以在这里找到它）。但我有一些问题。

Here is what I do:

这是我所做的：

#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN

# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

However, df_filledis a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?

但是，df_filled不知何故是单个向量，而不是填充的数据框。如何通过插补获得数据框？

Update

更新

I realized, fancyimputeneeds a numpay array. I hence converted the df_numericto a an array using as_matrix().

我意识到，fancyimpute需要一个numpay array. 因此，我使用将转换为df_numeric数组as_matrix()。

# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?

输出是一个列标签丢失的数据框。有什么方法可以检索标签？

Answer 1

采纳答案by NicolasWoloszko

df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)

The np.arraythat is returned by the .complete()method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=)of a pandas dataframe whose cols and indexes are the same as the original data frame.

的np.array由所述返回.complete()的fancyimpute对象的方法（无论是小鼠或KNN）被供给作为内容(argument data=)一个大Pandas数据帧，其COLS和索引中的相同的原始数据帧。

Answer 2

回答by Miriam Farber

Add the following lines after your code:

在您的代码后添加以下几行：

df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index

Answer 3

回答by jander081

I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.

我看到了花哨的impute和pandas的挫败感。这是一个使用递归覆盖方法的相当基本的包装器。接收并输出一个数据框 - 列名完好无损。这些类型的包装器与管道配合得很好。

from fancyimpute import SoftImpute

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        return pd.DataFrame(z, index=X.index, columns=X.columns)

Answer 4

回答by Beau Hilton

I really appreciate @jander081's approach, and expanded on it a tiny bit to deal with setting categorical columns. I had a problem where the categorical columns would get unset and create errors during training, so modified the code as follows:

我真的很欣赏@jander081 的方法，并对其进行了一点扩展以处理设置分类列。我遇到了一个问题，即分类列会在训练期间未设置并产生错误，因此将代码修改如下：

from fancyimpute import SoftImpute
import pandas as pd

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        df = pd.DataFrame(z, index=X.index, columns=X.columns)
        cats = list(X.select_dtypes(include='category'))
        df[cats] = df[cats].astype('category')

        # return pd.DataFrame(z, index=X.index, columns=X.columns)
        return df

pandas 使用fancyimpute和pandas进行数据插补

提问by Rachel

Update

更新

采纳答案by NicolasWoloszko

回答by Miriam Farber

回答by jander081

回答by Beau Hilton

相关推荐

最近更新

标签

pandas 使用fancyimpute和pandas进行数据插补

提问by Rachel

Update

更新

采纳答案by NicolasWoloszko

回答by Miriam Farber

回答by jander081

回答by Beau Hilton

相关推荐

pandas Numpy Array，数据必须是一维的

基于 python pandas 中其他列的值创建一个新列

pandas 使用 sklearn 的 KFold 分离熊猫数据框

Django 可以与 Pandas 和 numpy 配合使用吗？

相关推荐

最近更新

标签