Python Pandas 数据框插入缺失数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34693079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:29:13  来源:igfitidea点击:

Python pandas dataframe interpolate missing data

pythonpandasinterpolation

提问by Unnikrishnan

I have a data set like the following. We only have data for the last day of a month I am trying to interpolate rest of it, is it the right way of doing it?

我有一个如下所示的数据集。我们只有一个月最后一天的数据,我正在尝试插入其余的数据,这是正确的做法吗?

Date  Australia China
2011-01-01  NaN   NaN
2011-01-02  NaN   NaN
-           -     -
-           -     -
2011-01-31  4.75  5.81
2011-02-01  NaN   NaN
2011-02-02  NaN   NaN
-           -     -
-           -     -
2011-02-28  4.75  5.81
2011-03-01  NaN   NaN
2011-03-02  NaN   NaN
-           -     -
-           -     -
2011-03-31  4.75  6.06
2011-04-01  NaN   NaN
2011-04-02  NaN   NaN
-           -     -
-           -     -
2011-04-30  4.75  6.06

For interpolate this dataframe to find missing NaN values I am using the following code

为了插入此数据框以查找丢失的 NaN 值,我使用以下代码

import pandas as pd
df = pd.read_csv("data.csv", index_col="Date")
df.index = pd.DatetimeIndex(df.index)
df.interpolate(method='linear', axis=0).ffill().bfill()

But I am getting an error "TypeError: Cannot interpolate with all NaNs."

但我收到一个错误“类型错误:无法插入所有 NaN”。

What might be wrong here, how I can fix this?

这里可能有什么问题,我该如何解决?

Thanks.

谢谢。

回答by jezrael

You can try convert dataframeto floatby astype:

您可以尝试转换dataframefloat通过astype

import pandas as pd

df = pd.read_csv("data.csv", index_col=['Date'], parse_dates=['Date'])

print df

            Australia  China
Date                        
2011-01-31       4.75   5.81
2011-02-28       4.75   5.81
2011-03-31       4.75   6.06
2011-04-30       4.75   6.06

df = df.reindex(pd.date_range("2011-01-01", "2011-10-31"), fill_value="NaN")

#convert to float
df = df.astype(float)

df = df.interpolate(method='linear', axis=0).ffill().bfill()
print df

            Australia  China
2011-01-01       4.75   5.81
2011-01-02       4.75   5.81
2011-01-03       4.75   5.81
2011-01-04       4.75   5.81
2011-01-05       4.75   5.81
2011-01-06       4.75   5.81
2011-01-07       4.75   5.81
2011-01-08       4.75   5.81
2011-01-09       4.75   5.81
2011-01-10       4.75   5.81
2011-01-11       4.75   5.81
2011-01-12       4.75   5.81
2011-01-13       4.75   5.81
2011-01-14       4.75   5.81
2011-01-15       4.75   5.81
2011-01-16       4.75   5.81
2011-01-17       4.75   5.81
2011-01-18       4.75   5.81
2011-01-19       4.75   5.81
2011-01-20       4.75   5.81
2011-01-21       4.75   5.81
2011-01-22       4.75   5.81
2011-01-23       4.75   5.81
2011-01-24       4.75   5.81
2011-01-25       4.75   5.81
2011-01-26       4.75   5.81
2011-01-27       4.75   5.81
2011-01-28       4.75   5.81
2011-01-29       4.75   5.81
2011-01-30       4.75   5.81
...               ...    ...
2011-10-02       4.75   6.06
2011-10-03       4.75   6.06
2011-10-04       4.75   6.06
2011-10-05       4.75   6.06
2011-10-06       4.75   6.06
2011-10-07       4.75   6.06
2011-10-08       4.75   6.06
2011-10-09       4.75   6.06
2011-10-10       4.75   6.06
2011-10-11       4.75   6.06
2011-10-12       4.75   6.06
2011-10-13       4.75   6.06
2011-10-14       4.75   6.06
2011-10-15       4.75   6.06
2011-10-16       4.75   6.06
2011-10-17       4.75   6.06
2011-10-18       4.75   6.06
2011-10-19       4.75   6.06
2011-10-20       4.75   6.06
2011-10-21       4.75   6.06
2011-10-22       4.75   6.06
2011-10-23       4.75   6.06
2011-10-24       4.75   6.06
2011-10-25       4.75   6.06
2011-10-26       4.75   6.06
2011-10-27       4.75   6.06
2011-10-28       4.75   6.06
2011-10-29       4.75   6.06
2011-10-30       4.75   6.06
2011-10-31       4.75   6.06

[304 rows x 2 columns]

And you can omit ffill(), because NaNare only in first rows of dataframe:

你可以省略ffill(),因为NaN只在第一行dataframe

df = df.interpolate(method='linear', axis=0).ffill().bfill()

to:

到:

df = df.interpolate(method='linear', axis=0).bfill()

回答by station

You can try dropping NaN from the dataset before interpolating.

您可以尝试在插值之前从数据集中删除 NaN。

import pandas as pd
df = pd.read_csv("data.csv", index_col="Date")
df = df.dropna()
df.index = pd.DatetimeIndex(df.index)
df.interpolate(method='linear', axis=0).ffill().bfill()