Pandas：估算 NaN

Question

提问by Zhubarb

I have an incomplete dataframe, incomplete_df, as below. I want to impute the missing amounts with the average amountof the corresponding id. If the average for that specific idis itself NaN (see id=4), I want to use the overall average.

我有一个不完整的数据框，incomplete_df如下所示。我想amount用amount相应的id. 如果该特定的平均值id本身就是 NaN（请参阅参考资料id=4），我想使用整体平均值。

Below are the example data and my highly inefficient solution:

以下是示例数据和我非常低效的解决方案：

import pandas as pd
import numpy as np
incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
                              'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
                         'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN] 
                         }, columns=['id','type','amount'])

# Forrest Gump Solution
for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
    cur_id = incomplete_df.loc[idx, 'id']
    if (cur_id in means.index ):
        incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
    else:
        incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's

What is the fastest and the most pythonic/pandonic way to achieve this?

实现这一目标的最快和最 Pythonic/Pandonic 的方法是什么？

Answer 1

回答by DSM

Disclaimer: I'm not really interested in the fastest solution but the most pandorable.

免责声明：我对最快的解决方案并不感兴趣，但对最讨人喜欢的解决方案并不感兴趣。

Here, I think that would be something like:

在这里，我认为这将是这样的：

>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)

which produces

产生

>>> df
    id   type  amount
0    1    one   345.0
1    2    one   928.0
2    3    two   942.0
3    2  three   645.0
4    2    two   113.0
5    3  three   942.0
6    1    one   442.0
7    1    two   539.0
8    1    one   442.0
9    2  three   814.0
10   4    one   615.2

[11 rows x 3 columns]

There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.

有许多明显的调整取决于您希望链式插补过程如何进行。

Pandas：估算 NaN

提问by Zhubarb

回答by DSM

相关推荐

最近更新

标签

Pandas：估算 NaN

提问by Zhubarb

回答by DSM

相关推荐

pandas 多列熊猫系列

在 Pandas 中读取包含列表的 csv

在 Pandas 中使用 groupby 的 TimeSeries

找不到 Python Pandas read_excel() 模块

相关推荐

最近更新

标签