Pandas:groupby 向前填充日期时间索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38597253/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: groupby forward fill with datetime index
提问by sapo_cosmico
I have a dataset that has two columns: company, and value.
It has a datetime index, which contains duplicates (on the same day, different companies have different values). The values have missing data, so I want to forward fill the missing data with the previous datapoint from the same company.
我有一个包含两列的数据集:公司和价值。
它有一个日期时间索引,其中包含重复项(同一天,不同的公司有不同的值)。这些值缺少数据,所以我想用来自同一家公司的以前的数据点向前填充缺失的数据。
However, I can't seem to find a good way to do this without running into odd groupby errors, suggesting that I'm doing something wrong.
但是,我似乎无法找到一种很好的方法来做到这一点,而不会遇到奇怪的 groupby 错误,这表明我做错了什么。
Toy data:
玩具数据:
a = pd.DataFrame({'a': [1, 2, None], 'b': [12,None,14]})
a.index = pd.DatetimeIndex(['2010', '2011', '2012'])
a = a.unstack()
a = a.reset_index().set_index('level_1')
a.columns = ['company', 'value']
a.sort_index(inplace=True)
Attempted solutions (didn't work: ValueError: cannot reindex from a duplicate axis
):
尝试的解决方案(不起作用:)ValueError: cannot reindex from a duplicate axis
:
a.groupby('company').ffill()
a.groupby('company')['value'].ffill()
a.groupby('company').fillna(method='ffill')
Hacky solution (that delivers the desired result, but is obviously just an ugly workaround):
Hacky 解决方案(提供所需的结果,但显然只是一个丑陋的解决方法):
a['value'] = a.reset_index().groupby(
'company').fillna(method='ffill')['value'].values
There is probably a simple and elegant way to do this, how is this performed in Pandas?
可能有一种简单而优雅的方法可以做到这一点,这在 Pandas 中是如何执行的?
回答by Psidom
One way is to use the transform
function to fill the value
column after group by:
一种方法是使用该transform
函数value
在分组后填充列:
import pandas as pd
a['value'] = a.groupby('company')['value'].transform(lambda v: v.ffill())
a
# company value
#level_1
#2010-01-01 a 1.0
#2010-01-01 b 12.0
#2011-01-01 a 2.0
#2011-01-01 b 12.0
#2012-01-01 a 2.0
#2012-01-01 b 14.0
To compare, the original data frame looks like:
为了比较,原始数据框如下所示:
# company value
#level_1
#2010-01-01 a 1.0
#2010-01-01 b 12.0
#2011-01-01 a 2.0
#2011-01-01 b NaN
#2012-01-01 a NaN
#2012-01-01 b 14.0
回答by root
You can add 'company'
to the index, making it unique, and do a simple ffill
via groupby
:
您可以添加'company'
到索引中,使其唯一,并ffill
通过groupby
以下方式执行简单操作:
a = a.set_index('company', append=True)
a = a.groupby(level=1).ffill()
From here, you can use reset_index
to revert the index back to the just the date, if necessary. I'd recommend keeping 'company'
as part of the the index (or just adding it to the index to begin with), so your index remains unique:
reset_index
如有必要,您可以从这里将索引恢复为日期。我建议保留'company'
作为索引的一部分(或者只是将其添加到索引中),这样您的索引就保持唯一:
a = a.reset_index(level=1)
回答by piRSquared
I like to use stacking and unstacking. In this case, it requires that I append the index with 'company'
.
我喜欢使用堆叠和拆垛。在这种情况下,它要求我在索引后附加'company'
.
a.set_index('company', append=True).unstack().ffill() \
.stack().reset_index('company')
Timing
定时
Conclusion@Psidom's solution works best under both scenarios.
结论@Psidom 的解决方案在这两种情况下都效果最好。
toy data
玩具数据
bigger toy
更大的玩具
np.random.seed([3,1415])
n = 10000
a = pd.DataFrame(np.random.randn(n, 10),
pd.date_range('2014-01-01', periods=n, freq='H', name='Time'),
pd.Index(list('abcdefghij'), name='company'))
a *= np.random.choice((1, np.nan), (n, 10), p=(.6, .4))
a = a.stack(dropna=False).rename('value').reset_index('company')