Pandas 中日期序列(不是索引)的算术
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13478597/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Arithmetic on date series (not an index) in Pandas
提问by ChrisArmstrong
(Python 2.7, Pandas 0.9)
(Python 2.7,Pandas 0.9)
This seems like a simple thing to do, but I can't figure out how to calculate the difference between two date columns in a dataframe using Pandas. This dataframe already has an index, so making either column into a DateTimeIndex is not desirable.
这似乎是一件简单的事情,但我无法弄清楚如何使用 Pandas 计算数据框中两个日期列之间的差异。这个数据框已经有一个索引,所以将任一列变成 DateTimeIndex 都是不可取的。
To convert each date column from strings I used:
要从我使用的字符串转换每个日期列:
data.Date_Column = pd.to_datetime(data.Date_Column)
From there, to get elapsed time between 2 columns, I do:
从那里,为了获得两列之间的经过时间,我这样做:
data.Closed_Date - data.Created_Date
which returns an error:
返回错误:
TypeError: %d format: a number is required, not a numpy.timedelta64
Checking dtypes on both columns yields datetime64[ns] and the individual dates in the array are type timestamp.
检查两列上的 dtypes 会产生 datetime64[ns] 并且数组中的各个日期是时间戳类型。
What am I missing?
我错过了什么?
EDIT:
编辑:
Here's an example where I can create separate DateTimeIndex objects and accomplish what I want, but when I try to do it in the context of a dataframe, it fails.
这是一个示例,我可以在其中创建单独的 DateTimeIndex 对象并完成我想要的操作,但是当我尝试在数据帧的上下文中执行此操作时,它失败了。
Created_Date = pd.DatetimeIndex(data['Created_Date'], copy=True)
Closed_Date = pd.DatetimeIndex(data['Closed_Date'], copy=True)
Closed_Date.day - Created_Date.day
[Out] array([ -3, -16, 5, ..., 0, 0, 0])
Now the same but in a dataframe:
现在相同,但在数据框中:
data.Created_Date = pd.DatetimeIndex(data['Created_Date'], copy=True)
data.Closed_Date = pd.DatetimeIndex(data.Closed_Date, copy=True)
data.Created_Date.day - data.Created_Date.day
AttributeError: 'Series' object has no attribute 'day'
Here's some of the data if you want to play around with it:
如果你想玩玩它,这里有一些数据:
data['Created Date'][0:10].to_dict()
{0: '1/1/2009 0:00',
1: '1/1/2009 0:00',
2: '1/1/2009 0:00',
3: '1/1/2009 0:00',
4: '1/1/2009 0:00',
5: '1/1/2009 0:00',
6: '1/1/2009 0:00',
7: '1/1/2009 0:00',
8: '1/1/2009 0:00',
9: '1/1/2009 0:00'}
data['Closed Date'][0:10].to_dict()
{0: '1/7/2009 0:00',
1: nan,
2: '1/1/2009 0:00',
3: '1/1/2009 0:00',
4: '1/1/2009 0:00',
5: '1/12/2009 0:00',
6: '1/12/2009 0:00',
7: '1/7/2009 0:00',
8: '1/10/2009 0:00',
9: '1/7/2009 0:00'}
回答by Andy Hayden
Update: A useful workaround is to just smash this with the DatetimeIndex constructor (which is usually much faster than an apply), for example:
更新:一个有用的解决方法是使用 DatetimeIndex 构造函数(通常比应用快得多)粉碎它,例如:
DatetimeIndex(df['Created_Date']).day
In 0.15 this will be vailable in the dt attribute (along with other datetime methods):
在 0.15 中,这将在 dt 属性中可用(以及其他日期时间方法):
df['Created_Date'].dt.day
Your error was the syntax, which although one might hope it would work, it doesn't:
你的错误是语法,虽然人们可能希望它会起作用,但它不会:
data.Created_Date.day - data.Created_Date.day
AttributeError: 'Series' object has no attribute 'day'
With more complicated selections like this one you can use apply:
对于像这样的更复杂的选择,您可以使用apply:
In [111]: df['sub'] = df.apply(lambda x: x['Created_Date'].day - x['Closed_Date'].day, axis=1)
In [112]: df[['Created_Date','Closed_Date','sub']]
Out[112]:
Created_Date Closed_Date sub
0 2009-01-07 00:00:00 2009-01-01 00:00:00 6
1 NaT 2009-01-01 00:00:00 9
2 2009-01-01 00:00:00 2009-01-01 00:00:00 0
3 2009-01-01 00:00:00 2009-01-01 00:00:00 0
4 2009-01-01 00:00:00 2009-01-01 00:00:00 0
5 2009-01-12 00:00:00 2009-01-01 00:00:00 11
6 2009-01-12 00:00:00 2009-01-01 00:00:00 11
7 2009-01-07 00:00:00 2009-01-01 00:00:00 6
8 2009-01-10 00:00:00 2009-01-01 00:00:00 9
9 2009-01-07 00:00:00 2009-01-01 00:00:00 6
Be wary, you'll probably ought to do something separately with these NaTs:
小心,您可能应该对这些NaTs单独做一些事情:
In [114]: df.ix[1][1].day # NaT.day
Out[114]: -1
.
.
Note: there is similarly strange behaviour using .dayson a timedelta with NaT:
注意:.days在 timedelta 上使用具有类似的奇怪行为NaT:
In [115]: df['sub2'] = df.apply(lambda x: (x['a'] - x['b']).days, axis=1)
In [116]: df['sub2'][1]
Out[116]: 92505

