Pandas pd.cut() - 合并日期时间列/系列

Question

提问by Arthur D. Howland

Attempting to do a bin using pd.cut() but it is fairly elaborate-

尝试使用 pd.cut() 做一个垃圾箱，但它相当精细-

A collegue sends me multiple files with report dates such as:

一位同事向我发送了多个带有报告日期的文件，例如：

 '03-16-2017 to 03-22-2017'
 '03-23-2017 to 03-29-2017'
 '03-30-2017 to 04-05-2017'

They are all combined into a single dataframe and given a column name, df['Filedate'] so that every record in the file has the correct filedate.

它们都被组合成一个单一的数据框，并给出一个列名，df['Filedate']，这样文件中的每条记录都有正确的文件日期。

The last day is a cutoff point, so I created a new column df['Filedate_bin'] which converts the last day to 3/22/2017, 3/29/2017, 4/05/2017 as a string.

最后一天是一个截止点，所以我创建了一个新列 df['Filedate_bin'] 它将最后一天转换为 3/22/2017、3/29/2017、4/05/2017 作为字符串。

Then I created a list: Filedate_bin_list= df.Filedate_bin.unique(). As a result I have a unique list of string cutoff dates that I would like to use as bins.

然后我创建了一个列表：Filedate_bin_list= df.Filedate_bin.unique()。因此，我有一个唯一的字符串截止日期列表，我想将其用作垃圾箱。

Importing different data into dataframe, there is a column of transaction dates: 3/28/2017, 3/29/2017, 3/30/2017, 4/1/2017, 4/2/2017, etc. Assigning them to a bin is difficult, it tried:

将不同的数据导入数据框，有一列交易日期：3/28/2017、3/29/2017、3/30/2017、4/1/2017、4/2/2017等。将它们分配给一个bin 很难，它尝试过：

df['bin'] = pd.cut(df.Processed_date, Filedate_bin_list)

Received TypeError: unsupported operand type for -: 'str' and 'str'

已收到 TypeError: unsupported operand type for -: 'str' and 'str'

Went back and tried converting the Filedate_bin to datetime, format='%m/%d/%Y' and get

回去尝试将 Filedate_bin 转换为 datetime，format='%m/%d/%Y' 并得到

TypeError: Cannot cast ufunc less input from dtype('<m8[ns]') to dtype ('<m8') with casting rule 'same_kind'.

Is there a better way to bin my processed_date(s) to text bins?

有没有更好的方法将我的 processing_date(s) 放入文本箱？

Am trying to tie in my processed dates 3/27/2017 to '03-23-2017 to 03-29-2017'

我试图将我处理的日期 3/27/2017 与“03-23-2017 至 03-29-2017”联系起来

Answer 1

回答by MaxU

UPDATE:starting from Pandas v0.20.1 (May 5, 2017)pd.cutand pd.qcutsupport datetime64 and timedelta64 dtypes (GH14714, GH14798).

更新：从Pandas v0.20.1（2017 年 5 月 5 日）开始pd.cut并pd.qcut支持 datetime64 和 timedelta64 dtypes（GH14714、GH14798）。

Thanks @lighthouse65 for checking this!

感谢@lighthouse65 检查这个！

Old answer:

旧答案：

Consider this approach:

考虑这种方法：

df = pd.DataFrame(pd.date_range('2000-01-02', freq='1D', periods=15), columns=['Date'])

bins_dt = pd.date_range('2000-01-01', freq='3D', periods=6)
bins_str = bins_dt.astype(str).values

labels = ['({}, {}]'.format(bins_str[i-1], bins_str[i]) for i in range(1, len(bins_str))]

df['cat'] = pd.cut(df.Date.astype(np.int64)//10**9,
                   bins=bins_dt.astype(np.int64)//10**9,
                   labels=labels)

Result:

结果：

In [59]: df
Out[59]:
         Date                       cat
0  2000-01-02  (2000-01-01, 2000-01-04]
1  2000-01-03  (2000-01-01, 2000-01-04]
2  2000-01-04  (2000-01-01, 2000-01-04]
3  2000-01-05  (2000-01-04, 2000-01-07]
4  2000-01-06  (2000-01-04, 2000-01-07]
5  2000-01-07  (2000-01-04, 2000-01-07]
6  2000-01-08  (2000-01-07, 2000-01-10]
7  2000-01-09  (2000-01-07, 2000-01-10]
8  2000-01-10  (2000-01-07, 2000-01-10]
9  2000-01-11  (2000-01-10, 2000-01-13]
10 2000-01-12  (2000-01-10, 2000-01-13]
11 2000-01-13  (2000-01-10, 2000-01-13]
12 2000-01-14  (2000-01-13, 2000-01-16]
13 2000-01-15  (2000-01-13, 2000-01-16]
14 2000-01-16  (2000-01-13, 2000-01-16]

In [60]: df.dtypes
Out[60]:
Date    datetime64[ns]
cat           category
dtype: object

Explanation:

解释：

df.Date.astype(np.int64)//10**9- converts datetimevalues into UNIX epoch (timestamp - # of seconds since 1970-01-01 00:00:00):

df.Date.astype(np.int64)//10**9- 将datetime值转换为 UNIX 纪元（时间戳 - 自以来的秒数1970-01-01 00:00:00）：

In [65]: df.Date.astype(np.int64)//10**9
Out[65]:
0     946771200
1     946857600
2     946944000
3     947030400
4     947116800
5     947203200
6     947289600
7     947376000
8     947462400
9     947548800
10    947635200
11    947721600
12    947808000
13    947894400
14    947980800
Name: Date, dtype: int64

the same will applyied to bins:

这同样适用于bins：

In [66]: bins_dt.astype(np.int64)//10**9
Out[66]: Int64Index([946684800, 946944000, 947203200, 947462400, 947721600, 947980800], dtype='int64')

labels:

标签：

In [67]: labels
Out[67]:
['(2000-01-01, 2000-01-04]',
 '(2000-01-04, 2000-01-07]',
 '(2000-01-07, 2000-01-10]',
 '(2000-01-10, 2000-01-13]',
 '(2000-01-13, 2000-01-16]']

Pandas pd.cut() - 合并日期时间列/系列

提问by Arthur D. Howland

回答by MaxU

相关推荐

最近更新

标签

Pandas pd.cut() - 合并日期时间列/系列

提问by Arthur D. Howland

回答by MaxU

相关推荐

pandas 来自熊猫数据框中列的热图

pandas 使用 python2.7 从 Amazon s3 读取 csv

数据框 -pandas/python 中所有可能的列组合

pandas Python DataFrame：使用字典替换值，如果不在字典中则转换 NaN

相关推荐

最近更新

标签