Python pandas DataFrame:用列的平均值替换 nan 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18689823/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:29:22  来源:igfitidea点击:

pandas DataFrame: replace nan values with average of columns

pythonpandasnan

提问by piokuc

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nanvalues in it as well.

我有一个主要填充实数的 Pandas DataFrame,但其中也有一些nan值。

How can I replace the nans with averages of columns where they are?

如何nan用它们所在的列的平均值替换s?

This question is very similar to this one: numpy array: replace nan values with average of columnsbut, unfortunately, the solution given there doesn't work for a pandas DataFrame.

这个问题与这个问题非常相似:numpy array:replace nan values with average of columns但不幸的是,那里给出的解决方案不适用于pandas DataFrame。

采纳答案by bmu

You can simply use DataFrame.fillnato fill the nan's directly:

您可以简单地使用直接DataFrame.fillna填充nan's :

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillnasays that valueshould be a scalar or a dict, however, it seems to work with a Seriesas well. If you want to pass a dict, you could use df.mean().to_dict().

的文档字符串fillna表示value应该是标量或字典,但是,它似乎也适用于 a Series。如果你想传递一个字典,你可以使用df.mean().to_dict().

回答by Jeff

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

每列应用该列的平均值并填充

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

回答by Ammar Shigri

Try:

尝试:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

回答by guibor

Another option besides those above is:

除了上述选项之外的另一个选项是:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

它不如之前的均值响应优雅,但如果您希望用其他列函数替换空值,它可能会更短。

回答by Pranay Aryal

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

如果您想用均值来估算缺失值并且想要逐列进行,那么这只会用该列的均值来估算。这可能更具可读性。

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

回答by Roshan jha

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

回答by Sunny Barnwal

Directly use df.fillna(df.mean())to fill all the null value with mean

直接用df.fillna(df.mean())均值填充所有空值

If you want to fill null value with mean of that column then you can use this

如果你想用该列的平均值填充空值,那么你可以使用这个

suppose x=df['Item_Weight']here Item_Weightis column name

假设x=df['Item_Weight']这里Item_Weight是列名

here we are assigning (fill null values of x with mean of x into x)

在这里我们分配(用 x 的平均值填充 x 的空值到 x 中)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

如果你想用一些字符串填充空值,然后使用

here Outlet_sizeis column name

Outlet_size是列名

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

回答by pink.slash

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Pandas:如何用nan一列的平均值(mean)、中位数或其他统计数据替换 NaN ( ) 值

Say your DataFrame is dfand you have one column called nr_items. This is: df['nr_items']

假设您的 DataFrame 是,df并且您有一列名为nr_items. 这是: df['nr_items']

If you want to replacethe NaNvalues of your column df['nr_items']with the mean of the column:

如果您想NaNdf['nr_items']平均值替换列的值:

Use method .fillna():

使用方法.fillna()

mean_value=df['nr_items].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

mean_value=df['nr_items].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new dfcolumn called nr_item_aveto store the new column with the NaNvalues replaced by the meanvalue of the column.

我创建了一个新df列,nr_item_ave用于存储新列,其中的NaN值替换mean为列的值。

You should be careful when using the mean. If you have outliersis more recommendable to use the median

使用mean. 如果您有异常值更推荐使用median

回答by Shrikant Chaudhari

using sklearn library preprocessing class

使用sklearn库预处理类

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_valuesvalue change to np.nanfrom NaN

注意:在最近的版本中参数missing_values值更改为np.nanfromNaN