pandas 仅将日期时间列与熊猫中的时间进行比较
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45561213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
compare a datetime column only to time in pandas
提问by pythonRcpp
I have a df like below
我有一个像下面这样的 df
col1, mydate
1, 25-DEC-2016 09:15:00
2, 25-DEC-2016 10:14:00
3, 25-DEC-2016 10:16:00
4, 25-DEC-2016 10:18:56
2, 25-DEC-2016 11:14:00
2, 25-DEC-2016 10:16:00
df.info(): mydate 323809 non-null object
I need to this dataframe according to time, like df having time less than 10:15:00, df having time less than 11:15:00
我需要根据时间使用此数据框,例如 df 的时间小于 10:15:00,df 的时间小于 11:15:00
So created my slice intervals using
所以创建我的切片间隔使用
times=[pd.to_datetime(i) for i in '10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00']
Then I convert my mydate type to time which takes a lot of time
然后我将 mydate 类型转换为时间,这需要很多时间
df['mydate']=df4.mydate.apply(lambda x: pd.to_datetime(x,infer_datetime_format=True).time())
The above command I think can be optimised, or there should be a better/faster way.
我认为可以优化上述命令,或者应该有更好/更快的方法。
Then I simply do
然后我简单地做
for time in times:
slice = df[df.mydate<time.time()]
My intent is only to compare df.mydate time with ['10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00']
(but not dates)and simply subset df
我的目的只是将 df.mydate 时间与['10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00']
(但不是日期)和简单的子集 df 进行比较
The above way works fine for me but I am looking for a better way.
上述方法对我来说很好,但我正在寻找更好的方法。
Additional: Interestingly sorting mydate was very fast (even though I did not convert to mydate col to datetime) using
附加:有趣的是,使用 mydate 排序非常快(即使我没有将 mydate col 转换为 datetime)
df.sort_values(by='mydate')
which lets me think that my way of subsetting should be faster.
这让我觉得我的子集方式应该更快。
mydate col will always be in 25-DEC-2016 09:15:00
format (Note DEC not Dec) can I use format='%d-%b-%Y %H:%M:%S'
mydate col 将始终采用25-DEC-2016 09:15:00
格式(注意 DEC 而不是 Dec)我可以使用吗format='%d-%b-%Y %H:%M:%S'
采纳答案by jezrael
I believe timedelta
is better for working in pandas - so first split
string column and select times for converting:
我相信timedelta
更适合在Pandas中工作 - 所以第一个split
字符串列和选择转换时间:
df['mydate'] = pd.to_timedelta(df['mydate'].str.split().str[1])
print (df)
col1 mydate
0 1 09:15:00
1 2 10:14:00
2 3 10:16:00
3 4 10:18:56
4 2 11:14:00
5 2 10:16:00
Convert list
too:
也转换list
:
times=pd.to_timedelta(['10:15:00','11:15:00','12:15:00',
'13:15:00','14:15:00','15:15:00', '15:30:00'])
print (times)
TimedeltaIndex(['10:15:00', '11:15:00', '12:15:00', '13:15:00', '14:15:00',
'15:15:00', '15:30:00'],
dtype='timedelta64[ns]', freq=None)
Last create slices:
最后创建切片:
for time in times:
sl = df[df.mydate<time]
print (sl)
回答by baldassarreFe
First of all, I suggest using pd.to_datetime
on the whole array/Series, so it would be:
首先,我建议pd.to_datetime
在整个阵列/系列上使用,所以它是:
pd.to_datetime(['10:15:00','11:15:00','12:15:00','13:15:00']).time
Rather than
而不是
[pd.to_datetime(i).time() for i in ['10:15:00','11:15:00','12:15:00','13:15:00']]
Secondly, you are right about the format. As stated in the documentation of pd.to_datetime
it is much faster (by x5-10 times) to use
其次,你是对的格式。正如其文档中所述,使用pd.to_datetime
速度要快得多(x5-10 倍)
pd.to_datetime(['25-DEC-2016 09:15:00', '25-DEC-2016 09:15:00'],
format='%d-%b-%Y %H:%M:%S')
Rather than
而不是
pd.to_datetime(['25-DEC-2016 09:15:00', '26-DEC-2016 09:15:00'],
infer_datetime_format=True)
Considering now your dataframe:
现在考虑您的数据框:
df = pd.DataFrame({'col1': [1, 2, 3, 2],
'mydate': ['25-DEC-2016 09:15:00',
'25-DEC-2016 11:15:00',
'26-DEC-2016 11:15:00',
'26-DEC-2016 12:15:00']})
>>>
col1 mydate
0 1 25-DEC-2016 09:15:00
1 2 25-DEC-2016 11:15:00
2 3 26-DEC-2016 11:15:00
3 2 26-DEC-2016 12:15:00
You can first transform the mydate
column in an actual datetime
Series:
您可以首先mydate
在实际datetime
系列中转换列:
df['mydate'] = pd.to_datetime(df.mydate, format='%d-%b-%Y %H:%M:%S')
Then you'll be able to access the date
and time
fields (and a lot more) through the dt
accessor:
然后您将能够通过访问器访问date
和time
字段(以及更多)dt
:
df.mydate.dt.date
>>>
0 2016-12-25
1 2016-12-25
2 2016-12-26
3 2016-12-26
df.mydate.dt.time
>>>
0 09:15:00
1 11:15:00
2 11:15:00
3 12:15:00
So when computing the slices you can use:
因此,在计算切片时,您可以使用:
for time in times:
slice = df[df.mydate.dt.time < time]
print(time, slice, sep='\n')
>>>
10:15:00
col1 mydate
0 1 2016-12-25 09:15:00
11:15:00
col1 mydate
0 1 2016-12-25 09:15:00
12:15:00
col1 mydate
0 1 2016-12-25 09:15:00
1 2 2016-12-25 11:15:00
2 3 2016-12-26 11:15:00
13:15:00
col1 mydate
0 1 2016-12-25 09:15:00
1 2 2016-12-25 11:15:00
2 3 2016-12-26 11:15:00
3 2 2016-12-26 12:15:00
Note how what you get are not actually slices, because they have overlapping records, so you might want to use something similar to:
注意你得到的实际上并不是切片,因为它们有重叠的记录,所以你可能想要使用类似的东西:
for start, end in zip(times, times[1:]):
slice = df[(start <= df.mydate.dt.time) & (df.mydate.dt.time <= end)]
As a final note, what you are trying to accomplish with the for loop can be obtained using the group byoperations from pandas. You just need to prepare a mytime
column with the times only:
最后要注意的是,您可以使用来自 Pandas的group by操作来获得您尝试使用 for 循环完成的任务。你只需要准备一个mytime
只有时间的列:
df['mytime'] = df.mydate.dt.time
groups = df.groupby('mytime')
for group_key, group_df in groups:
print(group_key, group_df, sep='\n')
>>>
09:15:00
col1 mydate mytime
0 1 2016-12-25 09:15:00 09:15:00
11:15:00
col1 mydate mytime
1 2 2016-12-25 11:15:00 11:15:00
2 3 2016-12-26 11:15:00 11:15:00
12:15:00
col1 mydate mytime
3 2 2016-12-26 12:15:00 12:15:00
The nice thing is that you don't need to operate on the single dataframes, but you can apply the same operations and aggregations on every group at the same time:
好处是您不需要对单个数据帧进行操作,但您可以同时对每个组应用相同的操作和聚合:
groups.size()
>>>
mytime
09:15:00 1
11:15:00 2
12:15:00 1
groups.sum()
>>>
col1
mytime
09:15:00 1
11:15:00 5
12:15:00 2