Python Pandas 数据框插入缺失数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34693079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas dataframe interpolate missing data
提问by Unnikrishnan
I have a data set like the following. We only have data for the last day of a month I am trying to interpolate rest of it, is it the right way of doing it?
我有一个如下所示的数据集。我们只有一个月最后一天的数据,我正在尝试插入其余的数据,这是正确的做法吗?
Date Australia China
2011-01-01 NaN NaN
2011-01-02 NaN NaN
- - -
- - -
2011-01-31 4.75 5.81
2011-02-01 NaN NaN
2011-02-02 NaN NaN
- - -
- - -
2011-02-28 4.75 5.81
2011-03-01 NaN NaN
2011-03-02 NaN NaN
- - -
- - -
2011-03-31 4.75 6.06
2011-04-01 NaN NaN
2011-04-02 NaN NaN
- - -
- - -
2011-04-30 4.75 6.06
For interpolate this dataframe to find missing NaN values I am using the following code
为了插入此数据框以查找丢失的 NaN 值,我使用以下代码
import pandas as pd
df = pd.read_csv("data.csv", index_col="Date")
df.index = pd.DatetimeIndex(df.index)
df.interpolate(method='linear', axis=0).ffill().bfill()
But I am getting an error "TypeError: Cannot interpolate with all NaNs."
但我收到一个错误“类型错误:无法插入所有 NaN”。
What might be wrong here, how I can fix this?
这里可能有什么问题,我该如何解决?
Thanks.
谢谢。
回答by jezrael
You can try convert dataframe
to float
by astype
:
您可以尝试转换dataframe
到float
通过astype
:
import pandas as pd
df = pd.read_csv("data.csv", index_col=['Date'], parse_dates=['Date'])
print df
Australia China
Date
2011-01-31 4.75 5.81
2011-02-28 4.75 5.81
2011-03-31 4.75 6.06
2011-04-30 4.75 6.06
df = df.reindex(pd.date_range("2011-01-01", "2011-10-31"), fill_value="NaN")
#convert to float
df = df.astype(float)
df = df.interpolate(method='linear', axis=0).ffill().bfill()
print df
Australia China
2011-01-01 4.75 5.81
2011-01-02 4.75 5.81
2011-01-03 4.75 5.81
2011-01-04 4.75 5.81
2011-01-05 4.75 5.81
2011-01-06 4.75 5.81
2011-01-07 4.75 5.81
2011-01-08 4.75 5.81
2011-01-09 4.75 5.81
2011-01-10 4.75 5.81
2011-01-11 4.75 5.81
2011-01-12 4.75 5.81
2011-01-13 4.75 5.81
2011-01-14 4.75 5.81
2011-01-15 4.75 5.81
2011-01-16 4.75 5.81
2011-01-17 4.75 5.81
2011-01-18 4.75 5.81
2011-01-19 4.75 5.81
2011-01-20 4.75 5.81
2011-01-21 4.75 5.81
2011-01-22 4.75 5.81
2011-01-23 4.75 5.81
2011-01-24 4.75 5.81
2011-01-25 4.75 5.81
2011-01-26 4.75 5.81
2011-01-27 4.75 5.81
2011-01-28 4.75 5.81
2011-01-29 4.75 5.81
2011-01-30 4.75 5.81
... ... ...
2011-10-02 4.75 6.06
2011-10-03 4.75 6.06
2011-10-04 4.75 6.06
2011-10-05 4.75 6.06
2011-10-06 4.75 6.06
2011-10-07 4.75 6.06
2011-10-08 4.75 6.06
2011-10-09 4.75 6.06
2011-10-10 4.75 6.06
2011-10-11 4.75 6.06
2011-10-12 4.75 6.06
2011-10-13 4.75 6.06
2011-10-14 4.75 6.06
2011-10-15 4.75 6.06
2011-10-16 4.75 6.06
2011-10-17 4.75 6.06
2011-10-18 4.75 6.06
2011-10-19 4.75 6.06
2011-10-20 4.75 6.06
2011-10-21 4.75 6.06
2011-10-22 4.75 6.06
2011-10-23 4.75 6.06
2011-10-24 4.75 6.06
2011-10-25 4.75 6.06
2011-10-26 4.75 6.06
2011-10-27 4.75 6.06
2011-10-28 4.75 6.06
2011-10-29 4.75 6.06
2011-10-30 4.75 6.06
2011-10-31 4.75 6.06
[304 rows x 2 columns]
And you can omit ffill()
, because NaN
are only in first rows of dataframe
:
你可以省略ffill()
,因为NaN
只在第一行dataframe
:
df = df.interpolate(method='linear', axis=0).ffill().bfill()
to:
到:
df = df.interpolate(method='linear', axis=0).bfill()
回答by station
You can try dropping NaN from the dataset before interpolating.
您可以尝试在插值之前从数据集中删除 NaN。
import pandas as pd
df = pd.read_csv("data.csv", index_col="Date")
df = df.dropna()
df.index = pd.DatetimeIndex(df.index)
df.interpolate(method='linear', axis=0).ffill().bfill()