Pandas pd.cut() - 合并日期时间列/系列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43500894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas pd.cut() - binning datetime column / series
提问by Arthur D. Howland
Attempting to do a bin using pd.cut() but it is fairly elaborate-
尝试使用 pd.cut() 做一个垃圾箱,但它相当精细-
A collegue sends me multiple files with report dates such as:
一位同事向我发送了多个带有报告日期的文件,例如:
'03-16-2017 to 03-22-2017'
'03-23-2017 to 03-29-2017'
'03-30-2017 to 04-05-2017'
They are all combined into a single dataframe and given a column name, df['Filedate'] so that every record in the file has the correct filedate.
它们都被组合成一个单一的数据框,并给出一个列名,df['Filedate'],这样文件中的每条记录都有正确的文件日期。
The last day is a cutoff point, so I created a new column df['Filedate_bin'] which converts the last day to 3/22/2017, 3/29/2017, 4/05/2017 as a string.
最后一天是一个截止点,所以我创建了一个新列 df['Filedate_bin'] 它将最后一天转换为 3/22/2017、3/29/2017、4/05/2017 作为字符串。
Then I created a list: Filedate_bin_list= df.Filedate_bin.unique(). As a result I have a unique list of string cutoff dates that I would like to use as bins.
然后我创建了一个列表:Filedate_bin_list= df.Filedate_bin.unique()。因此,我有一个唯一的字符串截止日期列表,我想将其用作垃圾箱。
Importing different data into dataframe, there is a column of transaction dates: 3/28/2017, 3/29/2017, 3/30/2017, 4/1/2017, 4/2/2017, etc. Assigning them to a bin is difficult, it tried:
将不同的数据导入数据框,有一列交易日期:3/28/2017、3/29/2017、3/30/2017、4/1/2017、4/2/2017等。将它们分配给一个bin 很难,它尝试过:
df['bin'] = pd.cut(df.Processed_date, Filedate_bin_list)
Received TypeError: unsupported operand type for -: 'str' and 'str'
已收到 TypeError: unsupported operand type for -: 'str' and 'str'
Went back and tried converting the Filedate_bin to datetime, format='%m/%d/%Y' and get
回去尝试将 Filedate_bin 转换为 datetime,format='%m/%d/%Y' 并得到
TypeError: Cannot cast ufunc less input from dtype('<m8[ns]') to dtype ('<m8') with casting rule 'same_kind'.
TypeError: Cannot cast ufunc less input from dtype('<m8[ns]') to dtype ('<m8') with casting rule 'same_kind'.
Is there a better way to bin my processed_date(s) to text bins?
有没有更好的方法将我的 processing_date(s) 放入文本箱?
Am trying to tie in my processed dates 3/27/2017 to '03-23-2017 to 03-29-2017'
我试图将我处理的日期 3/27/2017 与“03-23-2017 至 03-29-2017”联系起来
回答by MaxU
UPDATE:starting from Pandas v0.20.1 (May 5, 2017)pd.cut
and pd.qcut
support datetime64 and timedelta64 dtypes (GH14714, GH14798).
更新:从Pandas v0.20.1(2017 年 5 月 5 日)开始pd.cut
并pd.qcut
支持 datetime64 和 timedelta64 dtypes(GH14714、GH14798)。
Thanks @lighthouse65 for checking this!
Old answer:
旧答案:
Consider this approach:
考虑这种方法:
df = pd.DataFrame(pd.date_range('2000-01-02', freq='1D', periods=15), columns=['Date'])
bins_dt = pd.date_range('2000-01-01', freq='3D', periods=6)
bins_str = bins_dt.astype(str).values
labels = ['({}, {}]'.format(bins_str[i-1], bins_str[i]) for i in range(1, len(bins_str))]
df['cat'] = pd.cut(df.Date.astype(np.int64)//10**9,
bins=bins_dt.astype(np.int64)//10**9,
labels=labels)
Result:
结果:
In [59]: df
Out[59]:
Date cat
0 2000-01-02 (2000-01-01, 2000-01-04]
1 2000-01-03 (2000-01-01, 2000-01-04]
2 2000-01-04 (2000-01-01, 2000-01-04]
3 2000-01-05 (2000-01-04, 2000-01-07]
4 2000-01-06 (2000-01-04, 2000-01-07]
5 2000-01-07 (2000-01-04, 2000-01-07]
6 2000-01-08 (2000-01-07, 2000-01-10]
7 2000-01-09 (2000-01-07, 2000-01-10]
8 2000-01-10 (2000-01-07, 2000-01-10]
9 2000-01-11 (2000-01-10, 2000-01-13]
10 2000-01-12 (2000-01-10, 2000-01-13]
11 2000-01-13 (2000-01-10, 2000-01-13]
12 2000-01-14 (2000-01-13, 2000-01-16]
13 2000-01-15 (2000-01-13, 2000-01-16]
14 2000-01-16 (2000-01-13, 2000-01-16]
In [60]: df.dtypes
Out[60]:
Date datetime64[ns]
cat category
dtype: object
Explanation:
解释:
df.Date.astype(np.int64)//10**9
- converts datetime
values into UNIX epoch (timestamp - # of seconds since 1970-01-01 00:00:00
):
df.Date.astype(np.int64)//10**9
- 将datetime
值转换为 UNIX 纪元(时间戳 - 自 以来的秒数1970-01-01 00:00:00
):
In [65]: df.Date.astype(np.int64)//10**9
Out[65]:
0 946771200
1 946857600
2 946944000
3 947030400
4 947116800
5 947203200
6 947289600
7 947376000
8 947462400
9 947548800
10 947635200
11 947721600
12 947808000
13 947894400
14 947980800
Name: Date, dtype: int64
the same will applyied to bins
:
这同样适用于bins
:
In [66]: bins_dt.astype(np.int64)//10**9
Out[66]: Int64Index([946684800, 946944000, 947203200, 947462400, 947721600, 947980800], dtype='int64')
labels:
标签:
In [67]: labels
Out[67]:
['(2000-01-01, 2000-01-04]',
'(2000-01-04, 2000-01-07]',
'(2000-01-07, 2000-01-10]',
'(2000-01-10, 2000-01-13]',
'(2000-01-13, 2000-01-16]']