Pandas：groupby 向前填充日期时间索引

Question

提问by sapo_cosmico

I have a dataset that has two columns: company, and value.
It has a datetime index, which contains duplicates (on the same day, different companies have different values). The values have missing data, so I want to forward fill the missing data with the previous datapoint from the same company.

我有一个包含两列的数据集：公司和价值。
它有一个日期时间索引，其中包含重复项（同一天，不同的公司有不同的值）。这些值缺少数据，所以我想用来自同一家公司的以前的数据点向前填充缺失的数据。

However, I can't seem to find a good way to do this without running into odd groupby errors, suggesting that I'm doing something wrong.

但是，我似乎无法找到一种很好的方法来做到这一点，而不会遇到奇怪的 groupby 错误，这表明我做错了什么。

Toy data:

玩具数据：

a = pd.DataFrame({'a': [1, 2, None], 'b': [12,None,14]})
a.index = pd.DatetimeIndex(['2010', '2011', '2012'])  
a = a.unstack() 
a = a.reset_index().set_index('level_1') 
a.columns = ['company', 'value'] 
a.sort_index(inplace=True)

Attempted solutions (didn't work: ValueError: cannot reindex from a duplicate axis):

尝试的解决方案（不起作用：）ValueError: cannot reindex from a duplicate axis：

a.groupby('company').ffill() 
a.groupby('company')['value'].ffill() 
a.groupby('company').fillna(method='ffill')

Hacky solution (that delivers the desired result, but is obviously just an ugly workaround):

Hacky 解决方案（提供所需的结果，但显然只是一个丑陋的解决方法）：

a['value'] = a.reset_index().groupby(
    'company').fillna(method='ffill')['value'].values

There is probably a simple and elegant way to do this, how is this performed in Pandas?

可能有一种简单而优雅的方法可以做到这一点，这在 Pandas 中是如何执行的？

Answer 1

回答by Psidom

One way is to use the transformfunction to fill the valuecolumn after group by:

一种方法是使用该transform函数value在分组后填充列：

import pandas as pd
a['value'] = a.groupby('company')['value'].transform(lambda v: v.ffill())

a
#          company  value
#level_1        
#2010-01-01      a    1.0
#2010-01-01      b   12.0
#2011-01-01      a    2.0
#2011-01-01      b   12.0
#2012-01-01      a    2.0
#2012-01-01      b   14.0

To compare, the original data frame looks like:

为了比较，原始数据框如下所示：

#            company    value
#level_1        
#2010-01-01        a      1.0
#2010-01-01        b     12.0
#2011-01-01        a      2.0
#2011-01-01        b      NaN
#2012-01-01        a      NaN
#2012-01-01        b     14.0

Answer 2

回答by root

You can add 'company'to the index, making it unique, and do a simple ffillvia groupby:

您可以添加'company'到索引中，使其唯一，并ffill通过groupby以下方式执行简单操作：

a = a.set_index('company', append=True)
a = a.groupby(level=1).ffill()

From here, you can use reset_indexto revert the index back to the just the date, if necessary. I'd recommend keeping 'company'as part of the the index (or just adding it to the index to begin with), so your index remains unique:

reset_index如有必要，您可以从这里将索引恢复为日期。我建议保留'company'作为索引的一部分（或者只是将其添加到索引中），这样您的索引就保持唯一：

a = a.reset_index(level=1)

Answer 3

回答by piRSquared

I like to use stacking and unstacking. In this case, it requires that I append the index with 'company'.

我喜欢使用堆叠和拆垛。在这种情况下，它要求我在索引后附加'company'.

a.set_index('company', append=True).unstack().ffill() \
                                   .stack().reset_index('company')

Timing

定时

Conclusion@Psidom's solution works best under both scenarios.

结论@Psidom 的解决方案在这两种情况下都效果最好。

toy data

玩具数据

bigger toy

更大的玩具

np.random.seed([3,1415])
n = 10000
a = pd.DataFrame(np.random.randn(n, 10),
                 pd.date_range('2014-01-01', periods=n, freq='H', name='Time'),
                 pd.Index(list('abcdefghij'), name='company'))

a *= np.random.choice((1, np.nan), (n, 10), p=(.6, .4))

a = a.stack(dropna=False).rename('value').reset_index('company')

Pandas：groupby 向前填充日期时间索引

提问by sapo_cosmico

回答by Psidom

回答by root

回答by piRSquared

Timing

定时

相关推荐

最近更新

标签

Pandas：groupby 向前填充日期时间索引

提问by sapo_cosmico

回答by Psidom

回答by root

回答by piRSquared

Timing

定时

相关推荐

在 Pandas DataFrame 的每一行中查找第一个非零值

pandas 熊猫总结多个数据帧

将 HTML 表放入 Pandas Dataframe，而不是数据框对象列表

Pandas.read_excel：访问主目录

相关推荐

最近更新

标签