pandas 日期字段的 cut/qcut 相当于什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16319106/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the equivalent of cut/qcut for pandas date fields?
提问by patricksurry
Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's Newfor more.
更新:从 0.20.0 版开始,pandas cut/qcut 会处理日期字段。请参阅新增功能了解更多信息。
pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)
pd.cut 和 pd.qcut 现在支持 datetime64 和 timedelta64 dtypes(GH14714、GH14798)
Original question:Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!
原始问题:Pandas cut 和 qcut 函数非常适合“存储”连续数据以用于数据透视表等,但我看不到一种简单的方法来获取日期时间轴。令人沮丧,因为Pandas在所有与时间相关的东西上都很棒!
Here's a simple example:
这是一个简单的例子:
def randomDates(size, start=134e7, end=137e7):
return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')
df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10),
'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df
price qty recd ship
0 14.723510 3 2012-11-30 19:32:27 2013-03-08 23:10:12
1 53.535143 2 2012-07-25 14:26:45 2012-10-01 11:06:39
2 85.278743 7 2012-12-07 22:24:20 2013-02-26 10:23:20
3 35.940935 8 2013-04-18 13:49:43 2013-03-29 21:19:26
4 54.218896 8 2013-01-03 09:00:15 2012-08-08 12:50:41
5 61.404931 9 2013-02-10 19:36:54 2013-02-23 13:14:42
6 28.917693 1 2012-12-13 02:56:40 2012-09-08 21:14:45
7 88.440408 8 2013-04-04 22:54:55 2012-07-31 18:11:35
8 77.329931 7 2012-11-23 00:49:26 2012-12-09 19:27:40
9 46.540859 5 2013-03-13 11:37:59 2013-03-17 20:09:09
To bin by groups of price or quantity, I can use cut/qcut to bucket them:
要按价格或数量分组,我可以使用 cut/qcut 对它们进行分类:
df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()
price qty recd ship
qty price
(0, 1] [14.724, 46.541] 1 1 1 1
(1, 5] [14.724, 46.541] 2 2 2 2
(46.541, 61.405] 1 1 1 1
(5, 10] [14.724, 46.541] 1 1 1 1
(46.541, 61.405] 2 2 2 2
(61.405, 88.44] 3 3 3 3
But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?
但是我看不出有任何简单的方法可以用我的“recd”或“ship”日期字段做同样的事情。例如,生成一个类似的计数表,按(比方说)每月的 recd 和 ship 桶细分。似乎 resample() 拥有将所有机器分为时段,但我不知道如何在此处应用它。'date cut' 中的桶(或级别)相当于一个 pandas.PeriodIndex,然后我想用它所属的时期来标记 df['recd'] 的每个值?
So the kind of output I'm looking for would be something like:
所以我正在寻找的输出类型类似于:
ship recv count
2011-01 2011-01 1
2011-02 3
... ...
2011-02 2011-01 2
2011-02 6
... ... ...
More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:
更一般地说,我希望能够在输出中混合和匹配连续或分类变量。想象一下 df 还包含一个带有红色/黄色/绿色值的“状态”列,然后我可能想按状态、价格桶、发货和记录桶来汇总计数,因此:
ship recv price status count
2011-01 2011-01 [0-10) green 1
red 4
[10-20) yellow 2
... ... ...
2011-02 [0-10) yellow 3
... ... ... ...
As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?
作为一个额外的问题,将上面的 groupby() 结果修改为仅包含一个名为“count”的输出列的最简单方法是什么?
采纳答案by Garrett
Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't
seem to support time rules with a multiple > 1, such as '4M'). I think
the answer to your bonus question is .size().
这是使用 pandas.PeriodIndex 的解决方案(警告:PeriodIndex 似乎不支持倍数 > 1 的时间规则,例如“4M”)。我认为你的奖金问题的答案是.size()。
In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
....: pd.PeriodIndex(df.ship, freq='Q'),
....: pd.cut(df['qty'], bins=[0,5,10]),
....: pd.qcut(df['price'],q=2),
....: ]).size()
Out[49]:
qty price
2012Q2 2013Q1 (0, 5] [2, 5] 1
2012Q3 2013Q1 (5, 10] [2, 5] 1
2012Q4 2012Q3 (5, 10] [2, 5] 1
2013Q1 (0, 5] [2, 5] 1
(5, 10] [2, 5] 1
2013Q1 2012Q3 (0, 5] (5, 8] 1
2013Q1 (5, 10] (5, 8] 2
2013Q2 2012Q4 (0, 5] (5, 8] 1
2013Q2 (0, 5] [2, 5] 1
回答by Jeff
Just need to set the index of the field you'd like to resample by, here's some examples
只需要设置你想要重新采样的字段的索引,这里有一些例子
In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]:
price qty
recd
2012-07-31 64.151194 9
2012-08-31 93.476665 7
2012-09-30 94.193027 7
2012-10-31 NaN NaN
2012-11-30 NaN NaN
2012-12-31 12.353405 6
2013-01-31 NaN NaN
2013-02-28 129.586697 7
2013-03-31 NaN NaN
2013-04-30 NaN NaN
2013-05-31 211.979583 13
In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]:
2012-07-31 price 1
qty 1
ship 1
2012-08-31 price 1
qty 1
ship 1
2012-09-30 price 2
qty 2
ship 2
2012-10-31 price 0
qty 0
ship 0
2012-11-30 price 0
qty 0
ship 0
2012-12-31 price 1
qty 1
ship 1
2013-01-31 price 0
qty 0
ship 0
2013-02-28 price 2
qty 2
ship 2
2013-03-31 price 0
qty 0
ship 0
2013-04-30 price 0
qty 0
ship 0
2013-05-31 price 3
qty 3
ship 3
dtype: int64
回答by patricksurry
I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this
我想出了一个想法,它依赖于 datetime64[ns] 的底层存储格式。如果你这样定义 dcut()
def dcut(dts, freq='d', right=True):
hi = pd.Period(dts.max(), freq=freq) + 1 # get first period past end of data
periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
# get a list of integer bin boundaries representing ns-since-epoch
# note the extra period gives us the extra right-hand bin boundary we need
bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
# bin our time field as integers
cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
# relabel the bins using the periods, omitting the extra one at the end
cut.levels = periods[:-1].format()
return cut
Then we can do what I wanted:
然后我们可以做我想做的:
df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()
To get:
要得到:
price qty recd ship
2012-07 2012-10 1 1 1 1
2012-11 2012-12 1 1 1 1
2013-03 1 1 1 1
2012-12 2012-09 1 1 1 1
2013-02 1 1 1 1
2013-01 2012-08 1 1 1 1
2013-02 2013-02 1 1 1 1
2013-03 2013-03 1 1 1 1
2013-04 2012-07 1 1 1 1
2013-03 1 1 1 1
I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?
我想您可以类似地定义 dqcut(),它首先将每个日期时间值“舍入”为代表其包含周期开始的整数(以您指定的频率),然后使用 qcut() 在这些边界中进行选择。或者首先对原始整数值执行 qcut() 并根据您选择的频率对结果箱进行舍入?
No joy on the bonus question yet? :)
还没有对奖金问题感到高兴吗?:)
回答by danodonovan
How about using Seriesand putting the parts of the DataFramethat you're interested into that, then calling cuton the series object?
如何使用Series和放入DataFrame您感兴趣的部分,然后调用cut系列对象?
price_series = pd.Series(df.price.tolist(), index=df.recd)
and then
进而
pd.qcut(price_series, q=3)
and so on. (Though I think @Jeff's answer is best)
等等。(虽然我认为@Jeff 的回答是最好的)

