使用 pandas 或 numpy 填充缺失的时间序列数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32241692/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:49:18  来源:igfitidea点击:

Fill missing timeseries data using pandas or numpy

pythonlistnumpydictionarypandas

提问by Amistad

I have a list of dictionaries which looks like this :

我有一个字典列表,看起来像这样:

L=[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
},
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

This basically is the result of a SQL query which when given a start date and an end date gives the count of total prescriptions for each month starting from the start date till the end month.However,for months where the prescriptions count is 0(Feb 2015),it completely skips that month.Is it possible using pandas or numpy to alter this list so that it adds an entry for the missing month with 0 as the total prescription as follows:

这基本上是 SQL 查询的结果,当给定开始日期和结束日期时,会给出从开始日期到结束月份的每个月的总处方计数。但是,对于处方计数为 0(二月2015),它完全跳过那个月。是否可以使用pandas或numpy来更改此列表,以便为缺少的月份添加一个条目,其中总处方为0,如下所示:

[
{
"timeline": "2014-10", 
"total_prescriptions": 17
}, 
{
"timeline": "2014-11", 
"total_prescriptions": 14
}, 
{
"timeline": "2014-12", 
"total_prescriptions": 8
{
"timeline": "2015-1", 
"total_prescriptions": 4
}, 
{
"timeline": "2015-2",   # 2015-2 to be inserted for missing month
"total_prescriptions": 0 # 0 to be inserted for total prescription
}, 
{
"timeline": "2015-3", 
"total_prescriptions": 10
}, 
{
"timeline": "2015-4", 
"total_prescriptions": 3
} 
]

回答by maxymoo

What you are talking about is called "Resampling" in Pandas; first convert the your time to a numpy datetime and set as your index:

你所说的在 Pandas 中叫做“重采样”;首先将您的时间转换为 numpy 日期时间并设置为您的索引:

df = pd.DataFrame(L)
df.index=pd.to_datetime(df.timeline,format='%Y-%m')
df
           timeline  total_prescriptions
timeline                                
2014-10-01  2014-10                   17
2014-11-01  2014-11                   14
2014-12-01  2014-12                    8
2015-01-01   2015-1                    4
2015-03-01   2015-3                   10
2015-04-01   2015-4                    3

Then you can add in your missing months with resample('MS')(MS stands for "month start" I guess), and use fillna(0)to convert null values to zero as in your requirement.

然后,您可以添加缺少的月份resample('MS')(我猜是 MS 代表“月份开始”),并fillna(0)按照您的要求将空值转换为零。

df = df.resample('MS').fillna(0)
df
            total_prescriptions
timeline                       
2014-10-01                   17
2014-11-01                   14
2014-12-01                    8
2015-01-01                    4
2015-02-01                  NaN
2015-03-01                   10
2015-04-01                    3

To convert back to your original format, convert the datetime index back to string using to_native_types, and then export using to_dict('records'):

要转换回原始格式,请使用 将日期时间索引转换回字符串to_native_types,然后使用 导出to_dict('records')

df['timeline']=df.index.to_native_types()
df.to_dict('records')
[{'timeline': '2014-10-01', 'total_prescriptions': 17.0},
 {'timeline': '2014-11-01', 'total_prescriptions': 14.0},
 {'timeline': '2014-12-01', 'total_prescriptions': 8.0},
 {'timeline': '2015-01-01', 'total_prescriptions': 4.0},
 {'timeline': '2015-02-01', 'total_prescriptions': 0.0},
 {'timeline': '2015-03-01', 'total_prescriptions': 10.0},
 {'timeline': '2015-04-01', 'total_prescriptions': 3.0}]