pandas 日期字段的 cut/qcut 相当于什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16319106/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:47:50  来源:igfitidea点击:

What's the equivalent of cut/qcut for pandas date fields?

pythonpandas

提问by patricksurry

Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's Newfor more.

更新:从 0.20.0 版开始,pandas cut/qcut 会处理日期字段。请参阅新增功能了解更多信息。

pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)

pd.cut 和 pd.qcut 现在支持 datetime64 和 timedelta64 dtypes(GH14714、GH14798)

Original question:Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!

原始问题:Pandas cut 和 qcut 函数非常适合“存储”连续数据以用于数据透视表等,但我看不到一种简单的方法来获取日期时间轴。令人沮丧,因为Pandas在所有与时间相关的东西上都很棒!

Here's a simple example:

这是一个简单的例子:

def randomDates(size, start=134e7, end=137e7):
    return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')

df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10), 
                   'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df

     price      qty recd                ship
0    14.723510   3  2012-11-30 19:32:27 2013-03-08 23:10:12
1    53.535143   2  2012-07-25 14:26:45 2012-10-01 11:06:39
2    85.278743   7  2012-12-07 22:24:20 2013-02-26 10:23:20
3    35.940935   8  2013-04-18 13:49:43 2013-03-29 21:19:26
4    54.218896   8  2013-01-03 09:00:15 2012-08-08 12:50:41
5    61.404931   9  2013-02-10 19:36:54 2013-02-23 13:14:42
6    28.917693   1  2012-12-13 02:56:40 2012-09-08 21:14:45
7    88.440408   8  2013-04-04 22:54:55 2012-07-31 18:11:35
8    77.329931   7  2012-11-23 00:49:26 2012-12-09 19:27:40
9    46.540859   5  2013-03-13 11:37:59 2013-03-17 20:09:09

To bin by groups of price or quantity, I can use cut/qcut to bucket them:

要按价格或数量分组,我可以使用 cut/qcut 对它们进行分类:

df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()

                       price  qty recd ship
qty     price               
(0, 1]  [14.724, 46.541]   1   1   1   1
(1, 5]  [14.724, 46.541]   2   2   2   2
        (46.541, 61.405]   1   1   1   1
(5, 10] [14.724, 46.541]   1   1   1   1
        (46.541, 61.405]   2   2   2   2
         (61.405, 88.44]   3   3   3   3

But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?

但是我看不出有任何简单的方法可以用我的“recd”或“ship”日期字段做同样的事情。例如,生成一个类似的计数表,按(比方说)每月的 recd 和 ship 桶细分。似乎 resample() 拥有将所有机器分为时段,但我不知道如何在此处应用它。'date cut' 中的桶(或级别)相当于一个 pandas.PeriodIndex,然后我想用它所属的时期来标记 df['recd'] 的每个值?

So the kind of output I'm looking for would be something like:

所以我正在寻找的输出类型类似于:

ship    recv     count
2011-01 2011-01  1
        2011-02  3
        ...      ...
2011-02 2011-01  2
        2011-02  6
...     ...      ...

More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:

更一般地说,我希望能够在输出中混合和匹配连续或分类变量。想象一下 df 还包含一个带有红色/黄色/绿色值的“状态”列,然后我可能想按状态、价格桶、发货和记录桶来汇总计数,因此:

ship    recv     price   status count
2011-01 2011-01  [0-10)   green     1
                            red     4
                 [10-20) yellow     2
                  ...      ...    ...
        2011-02  [0-10)  yellow     3
        ...      ...       ...    ...

As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?

作为一个额外的问题,将上面的 groupby() 结果修改为仅包含一个名为“count”的输出列的最简单方法是什么?

采纳答案by Garrett

Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't seem to support time rules with a multiple > 1, such as '4M'). I think the answer to your bonus question is .size().

这是使用 pandas.PeriodIndex 的解决方案(警告:PeriodIndex 似乎不支持倍数 > 1 的时间规则,例如“4M”)。我认为你的奖金问题的答案是.size()

In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
   ....:             pd.PeriodIndex(df.ship, freq='Q'),
   ....:             pd.cut(df['qty'], bins=[0,5,10]),
   ....:             pd.qcut(df['price'],q=2),
   ....:            ]).size()
Out[49]: 
                qty      price 
2012Q2  2013Q1  (0, 5]   [2, 5]    1
2012Q3  2013Q1  (5, 10]  [2, 5]    1
2012Q4  2012Q3  (5, 10]  [2, 5]    1
        2013Q1  (0, 5]   [2, 5]    1
                (5, 10]  [2, 5]    1
2013Q1  2012Q3  (0, 5]   (5, 8]    1
        2013Q1  (5, 10]  (5, 8]    2
2013Q2  2012Q4  (0, 5]   (5, 8]    1
        2013Q2  (0, 5]   [2, 5]    1

回答by Jeff

Just need to set the index of the field you'd like to resample by, here's some examples

只需要设置你想要重新采样的字段的索引,这里有一些例子

In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]: 
                 price  qty
recd                       
2012-07-31   64.151194    9
2012-08-31   93.476665    7
2012-09-30   94.193027    7
2012-10-31         NaN  NaN
2012-11-30         NaN  NaN
2012-12-31   12.353405    6
2013-01-31         NaN  NaN
2013-02-28  129.586697    7
2013-03-31         NaN  NaN
2013-04-30         NaN  NaN
2013-05-31  211.979583   13

In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]: 
2012-07-31  price    1
            qty      1
            ship     1
2012-08-31  price    1
            qty      1
            ship     1
2012-09-30  price    2
            qty      2
            ship     2
2012-10-31  price    0
            qty      0
            ship     0
2012-11-30  price    0
            qty      0
            ship     0
2012-12-31  price    1
            qty      1
            ship     1
2013-01-31  price    0
            qty      0
            ship     0
2013-02-28  price    2
            qty      2
            ship     2
2013-03-31  price    0
            qty      0
            ship     0
2013-04-30  price    0
            qty      0
            ship     0
2013-05-31  price    3
            qty      3
            ship     3
dtype: int64

回答by patricksurry

I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this

我想出了一个想法,它依赖于 datetime64[ns] 的底层存储格式。如果你这样定义 dcut()

def dcut(dts, freq='d', right=True):
    hi = pd.Period(dts.max(), freq=freq) + 1   # get first period past end of data
    periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
    # get a list of integer bin boundaries representing ns-since-epoch
    # note the extra period gives us the extra right-hand bin boundary we need
    bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
    # bin our time field as integers
    cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
    # relabel the bins using the periods, omitting the extra one at the end
    cut.levels = periods[:-1].format()
    return cut

Then we can do what I wanted:

然后我们可以做我想做的:

df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()

To get:

要得到:

                price qty recd ship
2012-07 2012-10   1    1    1    1
2012-11 2012-12   1    1    1    1
        2013-03   1    1    1    1  
2012-12 2012-09   1    1    1    1
        2013-02   1    1    1    1  
2013-01 2012-08   1    1    1    1
2013-02 2013-02   1    1    1    1
2013-03 2013-03   1    1    1    1
2013-04 2012-07   1    1    1    1
        2013-03   1    1    1    1  

I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?

我想您可以类似地定义 dqcut(),它首先将每个日期时间值“舍入”为代表其包含周期开始的整数(以您指定的频率),然后使用 qcut() 在这些边界中进行选择。或者首先对原始整数值执行 qcut() 并根据您选择的频率对结果箱进行舍入?

No joy on the bonus question yet? :)

还没有对奖金问题感到高​​兴吗?:)

回答by danodonovan

How about using Seriesand putting the parts of the DataFramethat you're interested into that, then calling cuton the series object?

如何使用Series和放入DataFrame您感兴趣的部分,然后调用cut系列对象?

price_series = pd.Series(df.price.tolist(), index=df.recd)

and then

进而

 pd.qcut(price_series, q=3)

and so on. (Though I think @Jeff's answer is best)

等等。(虽然我认为@Jeff 的回答是最好的)