Pandas 填充组内缺失的日期和值

Question

提问by broccoli

I've a data frame that looks like the following

我有一个如下所示的数据框

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0for the valcolumn. So the desired output is

我希望能够做的是在日期列中找到最小和最大日期，并扩展该列以包含所有日期，同时填写0该val列。所以期望的输出是

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

I've tried the solution mentioned hereand herebut they aren't what I'm after. Any pointers much appreciated.

我已经尝试过这里和这里提到的解决方案，但它们不是我所追求的。任何指针都非常感谢。

Answer 1

回答by ayhan

Initial Dataframe:

初始数据帧：

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

首先，将日期转换为日期时间：

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

然后，生成日期和唯一用户：

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

这将允许您创建一个 MultiIndex：

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

您可以使用它来重新索引您的 DataFrame：

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

然后可以按用户排序：

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1

Answer 2

回答by piRSquared

As @ayhan suggests

正如@ayhan 所建议的那样

x.dt = pd.to_datetime(x.dt)

One-liner using mostly @ayhan's ideas while incorporating stack/unstackand fill_value

单行主要使用@ayhan 的想法，同时结合stack/unstack和fill_value

x.set_index(
    ['dt', 'user']
).unstack(
    fill_value=0
).asfreq(
    'D', fill_value=0
).stack().sort_index(level=1).reset_index()

           dt user  val
0  2016-01-01    a    1
1  2016-01-02    a   33
2  2016-01-03    a    0
3  2016-01-04    a    0
4  2016-01-05    a    0
5  2016-01-06    a    0
6  2016-01-01    b    0
7  2016-01-02    b    0
8  2016-01-03    b    0
9  2016-01-04    b    0
10 2016-01-05    b    2
11 2016-01-06    b    1

Pandas 填充组内缺失的日期和值

提问by broccoli

回答by ayhan

回答by piRSquared

相关推荐

最近更新

标签

Pandas 填充组内缺失的日期和值

提问by broccoli

回答by ayhan

回答by piRSquared

相关推荐

Pandas DataFrame 上的条件均值

pandas 使用 sklearn 在 3 维上进行 K 均值聚类

pandas 拉取 MS 访问表并将它们放入 python 中的数据框中

pandas 将多个 csv 文件连接成具有相同标头的单个 csv - Python

相关推荐

最近更新

标签