pandas 调用resample后如何用0填充()?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39452095/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:59:42  来源:igfitidea点击:

How to fillna() with value 0 after calling resample?

pythonpandas

提问by displayname

Either I don't understand the documentationor it is outdated.

要么我不理解文档,要么它已经过时了。

If I run

如果我跑

user[["DOC_ACC_DT", "USER_SIGNON_ID"]].groupby("DOC_ACC_DT").agg(["count"]).resample("1D").fillna(value=0, method="ffill")

It get

它得到

TypeError: fillna() got an unexpected keyword argument 'value'

If I just run

如果我只是跑

.fillna(0)

I get

我得到

ValueError: Invalid fill method. Expecting pad (ffill), backfill (bfill) or nearest. Got 0

If I then set

如果我然后设置

.fillna(0, method="ffill") 

I get

我得到

TypeError: fillna() got multiple values for keyword argument 'method'

so the only thing that works is

所以唯一有效的是

.fillna("ffill")

but of course that makes just a forward fill. However, I want to replace NaNwith zeros. What am I doing wrong here?

但当然这只是向前填充。但是,我想NaN用零替换。我在这里做错了什么?

采纳答案by displayname

Well, I don't get why the code above is not working and I'm going to wait for somebody to give a better answer than this but I just found

好吧,我不明白为什么上面的代码不起作用,我将等待有人给出比这更好的答案,但我刚刚发现

.replace(np.nan, 0)

does what I would have expected from .fillna(0).

做了我所期望的.fillna(0)

回答by jezrael

I do some test and it is very interesting.

我做了一些测试,这很有趣。

Sample:

样本:

import pandas as pd
import numpy as np

np.random.seed(1)
rng = pd.date_range('1/1/2012', periods=20, freq='S')
df = pd.DataFrame({'a':['a'] * 10 + ['b'] * 10,
                   'b':np.random.randint(0, 500, len(rng))}, index=rng)
df.b.iloc[3:8] = np.nan
print (df)
                     a      b
2012-01-01 00:00:00  a   37.0
2012-01-01 00:00:01  a  235.0
2012-01-01 00:00:02  a  396.0
2012-01-01 00:00:03  a    NaN
2012-01-01 00:00:04  a    NaN
2012-01-01 00:00:05  a    NaN
2012-01-01 00:00:06  a    NaN
2012-01-01 00:00:07  a    NaN
2012-01-01 00:00:08  a  335.0
2012-01-01 00:00:09  a  448.0
2012-01-01 00:00:10  b  144.0
2012-01-01 00:00:11  b  129.0
2012-01-01 00:00:12  b  460.0
2012-01-01 00:00:13  b   71.0
2012-01-01 00:00:14  b  237.0
2012-01-01 00:00:15  b  390.0
2012-01-01 00:00:16  b  281.0
2012-01-01 00:00:17  b  178.0
2012-01-01 00:00:18  b  276.0
2012-01-01 00:00:19  b  254.0

Downsampling:

下采样

Possible solution with Resampler.asfreq:

可能的解决方案Resampler.asfreq

If use asfreq, behaviour is same aggregating by first:

如果使用asfreq,行为是相同的聚合first

print (df.groupby('a').resample('2S').first())
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    NaN
  2012-01-01 00:00:06  a    NaN
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0
print (df.groupby('a').resample('2S').first().fillna(0))
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    0.0
  2012-01-01 00:00:06  a    0.0
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0

print (df.groupby('a').resample('2S').asfreq().fillna(0))
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    0.0
  2012-01-01 00:00:06  a    0.0
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0

If use replaceanother values are aggregating as mean:

如果使用replace另一个值聚合为mean

print (df.groupby('a').resample('2S').mean())
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    NaN
  2012-01-01 00:00:06    NaN
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0
print (df.groupby('a').resample('2S').mean().fillna(0))
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    0.0
  2012-01-01 00:00:06    0.0
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0

print (df.groupby('a').resample('2S').replace(np.nan,0))
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    0.0
  2012-01-01 00:00:06    0.0
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0

Upsampling:

上采样

Use asfreq, it is same as replace:

使用asfreq, 等同于replace

print (df.groupby('a').resample('200L').asfreq().fillna(0))
                           a      b
a                                  
a 2012-01-01 00:00:00.000  a   37.0
  2012-01-01 00:00:00.200  0    0.0
  2012-01-01 00:00:00.400  0    0.0
  2012-01-01 00:00:00.600  0    0.0
  2012-01-01 00:00:00.800  0    0.0
  2012-01-01 00:00:01.000  a  235.0
  2012-01-01 00:00:01.200  0    0.0
  2012-01-01 00:00:01.400  0    0.0
  2012-01-01 00:00:01.600  0    0.0
  2012-01-01 00:00:01.800  0    0.0
  2012-01-01 00:00:02.000  a  396.0
  2012-01-01 00:00:02.200  0    0.0
  2012-01-01 00:00:02.400  0    0.0
  ...

print (df.groupby('a').resample('200L').replace(np.nan,0))
                               b
a                               
a 2012-01-01 00:00:00.000   37.0
  2012-01-01 00:00:00.200    0.0
  2012-01-01 00:00:00.400    0.0
  2012-01-01 00:00:00.600    0.0
  2012-01-01 00:00:00.800    0.0
  2012-01-01 00:00:01.000  235.0
  2012-01-01 00:00:01.200    0.0
  2012-01-01 00:00:01.400    0.0
  2012-01-01 00:00:01.600    0.0
  2012-01-01 00:00:01.800    0.0
  2012-01-01 00:00:02.000  396.0
  2012-01-01 00:00:02.200    0.0
  2012-01-01 00:00:02.400    0.0
  ...
print ((df.groupby('a').resample('200L').replace(np.nan,0).b == 
       df.groupby('a').resample('200L').asfreq().fillna(0).b).all())
True

Conclusion:

结论

For downsampling use same aggregating function like sum, firstor meanand for upsampling asfreq.

对于下采样,使用相同的聚合函数,如sum,firstmean和 用于上采样asfreq

回答by Ryszard Cetnarski

The issue here is that you try to call the fillnamethod from DatetimeIndexResamplerobject, which is returned by the resamplemethod. If you call an aggregation function before fillna it will work, for example: df.resample('1H').sum().fillna(0)

这里的问题是您尝试fillna从方法DatetimeIndexResampler返回的对象调用resample方法。如果您在 fillna 之前调用聚合函数,它将起作用,例如:df.resample('1H').sum().fillna(0)

回答by Nickil Maveli

The only workaround close to using fillnadirectly would be to call it after performing .head(len(df.index)).

接近fillna直接使用的唯一解决方法是在执行后调用它.head(len(df.index))

I'm presuming DF.headto be useful in this case mainly because when resample function is applied to a groupby object, it will act as a filter on the input, returning a reduced shape of the original due to elimination of groups.

我认为DF.head在这种情况下很有用,主要是因为当 resample 函数应用于 groupby 对象时,它将充当输入的过滤器,由于消除组而返回原始形状的缩小形状。

Calling DF.head()does not get affected by this transformation and returns the entire DF.

调用DF.head()不受此转换的影响并返回整个DF.

Demo:

演示:

np.random.seed(42)

df = pd.DataFrame(np.random.randn(10, 2),
              index=pd.date_range('1/1/2016', freq='10D', periods=10),
              columns=['A', 'B']).reset_index()

df
       index         A         B
0 2016-01-01  0.496714 -0.138264
1 2016-01-11  0.647689  1.523030
2 2016-01-21 -0.234153 -0.234137
3 2016-01-31  1.579213  0.767435
4 2016-02-10 -0.469474  0.542560
5 2016-02-20 -0.463418 -0.465730
6 2016-03-01  0.241962 -1.913280
7 2016-03-11 -1.724918 -0.562288
8 2016-03-21 -1.012831  0.314247
9 2016-03-31 -0.908024 -1.412304

Operations:

操作:

resampled_group = df[['index', 'A']].groupby(['index'])['A'].agg('count').resample('2D')
resampled_group.head(len(resampled_group.index)).fillna(0).head(20)

index
2016-01-01    1.0
2016-01-03    0.0
2016-01-05    0.0
2016-01-07    0.0
2016-01-09    0.0
2016-01-11    1.0
2016-01-13    0.0
2016-01-15    0.0
2016-01-17    0.0
2016-01-19    0.0
2016-01-21    1.0
2016-01-23    0.0
2016-01-25    0.0
2016-01-27    0.0
2016-01-29    0.0
2016-01-31    1.0
2016-02-02    0.0
2016-02-04    0.0
2016-02-06    0.0
2016-02-08    0.0
Freq: 2D, Name: A, dtype: float64