pandas 仅将日期时间列与熊猫中的时间进行比较

Question

提问by pythonRcpp

I have a df like below

我有一个像下面这样的 df

col1, mydate
1, 25-DEC-2016 09:15:00
2, 25-DEC-2016 10:14:00
3, 25-DEC-2016 10:16:00
4, 25-DEC-2016 10:18:56
2, 25-DEC-2016 11:14:00
2, 25-DEC-2016 10:16:00

df.info(): mydate    323809 non-null object

I need to this dataframe according to time, like df having time less than 10:15:00, df having time less than 11:15:00

我需要根据时间使用此数据框，例如 df 的时间小于 10:15:00，df 的时间小于 11:15:00

So created my slice intervals using

所以创建我的切片间隔使用

times=[pd.to_datetime(i) for i in '10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00']

Then I convert my mydate type to time which takes a lot of time

然后我将 mydate 类型转换为时间，这需要很多时间

df['mydate']=df4.mydate.apply(lambda x: pd.to_datetime(x,infer_datetime_format=True).time())

The above command I think can be optimised, or there should be a better/faster way.

我认为可以优化上述命令，或者应该有更好/更快的方法。

Then I simply do

然后我简单地做

for time in times:
  slice = df[df.mydate<time.time()]

My intent is only to compare df.mydate time with ['10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00'](but not dates)and simply subset df

我的目的只是将 df.mydate 时间与['10:15:00','11:15:00','12:15:00','13:15:00','14:15:00','15:15:00', '15:30:00']（但不是日期）和简单的子集 df 进行比较

The above way works fine for me but I am looking for a better way.

上述方法对我来说很好，但我正在寻找更好的方法。

Additional: Interestingly sorting mydate was very fast (even though I did not convert to mydate col to datetime) using

附加：有趣的是，使用 mydate 排序非常快（即使我没有将 mydate col 转换为 datetime）

df.sort_values(by='mydate')

which lets me think that my way of subsetting should be faster.

这让我觉得我的子集方式应该更快。

mydate col will always be in 25-DEC-2016 09:15:00format (Note DEC not Dec) can I use format='%d-%b-%Y %H:%M:%S'

mydate col 将始终采用25-DEC-2016 09:15:00格式（注意 DEC 而不是 Dec）我可以使用吗format='%d-%b-%Y %H:%M:%S'

Answer 1

采纳答案by jezrael

I believe timedeltais better for working in pandas - so first splitstring column and select times for converting:

我相信timedelta更适合在Pandas中工作 - 所以第一个split字符串列和选择转换时间：

df['mydate'] = pd.to_timedelta(df['mydate'].str.split().str[1])
print (df)
   col1   mydate
0     1 09:15:00
1     2 10:14:00
2     3 10:16:00
3     4 10:18:56
4     2 11:14:00
5     2 10:16:00

Convert listtoo:

也转换list：

times=pd.to_timedelta(['10:15:00','11:15:00','12:15:00',
                       '13:15:00','14:15:00','15:15:00', '15:30:00'])
print (times)
TimedeltaIndex(['10:15:00', '11:15:00', '12:15:00', '13:15:00', '14:15:00',
                '15:15:00', '15:30:00'],
               dtype='timedelta64[ns]', freq=None)

Last create slices:

最后创建切片：

for time in times:
  sl = df[df.mydate<time]
  print (sl)

Answer 2

回答by baldassarreFe

First of all, I suggest using pd.to_datetimeon the whole array/Series, so it would be:

首先，我建议pd.to_datetime在整个阵列/系列上使用，所以它是：

pd.to_datetime(['10:15:00','11:15:00','12:15:00','13:15:00']).time

Rather than

而不是

[pd.to_datetime(i).time() for i in ['10:15:00','11:15:00','12:15:00','13:15:00']]

Secondly, you are right about the format. As stated in the documentation of pd.to_datetimeit is much faster (by x5-10 times) to use

其次，你是对的格式。正如其文档中所述，使用pd.to_datetime速度要快得多（x5-10 倍）

pd.to_datetime(['25-DEC-2016 09:15:00', '25-DEC-2016 09:15:00'],
               format='%d-%b-%Y %H:%M:%S')

Rather than

而不是

pd.to_datetime(['25-DEC-2016 09:15:00', '26-DEC-2016 09:15:00'], 
               infer_datetime_format=True)

Considering now your dataframe:

现在考虑您的数据框：

df = pd.DataFrame({'col1': [1, 2, 3, 2], 
                   'mydate': ['25-DEC-2016 09:15:00',
                              '25-DEC-2016 11:15:00', 
                              '26-DEC-2016 11:15:00', 
                              '26-DEC-2016 12:15:00']})
>>>
   col1                mydate
0     1  25-DEC-2016 09:15:00
1     2  25-DEC-2016 11:15:00
2     3  26-DEC-2016 11:15:00
3     2  26-DEC-2016 12:15:00

You can first transform the mydatecolumn in an actual datetimeSeries:

您可以首先mydate在实际datetime系列中转换列：

df['mydate'] = pd.to_datetime(df.mydate, format='%d-%b-%Y %H:%M:%S')

Then you'll be able to access the dateand timefields (and a lot more) through the dtaccessor:

然后您将能够通过访问器访问date和time字段（以及更多）dt：

df.mydate.dt.date
>>>
0    2016-12-25
1    2016-12-25
2    2016-12-26
3    2016-12-26

df.mydate.dt.time
>>>
0    09:15:00
1    11:15:00
2    11:15:00
3    12:15:00

So when computing the slices you can use:

因此，在计算切片时，您可以使用：

for time in times:
    slice = df[df.mydate.dt.time < time]
    print(time, slice, sep='\n')
>>>
10:15:00
   col1              mydate
0     1 2016-12-25 09:15:00
11:15:00
   col1              mydate
0     1 2016-12-25 09:15:00
12:15:00
   col1              mydate
0     1 2016-12-25 09:15:00
1     2 2016-12-25 11:15:00
2     3 2016-12-26 11:15:00
13:15:00
   col1              mydate
0     1 2016-12-25 09:15:00
1     2 2016-12-25 11:15:00
2     3 2016-12-26 11:15:00
3     2 2016-12-26 12:15:00

Note how what you get are not actually slices, because they have overlapping records, so you might want to use something similar to:

注意你得到的实际上并不是切片，因为它们有重叠的记录，所以你可能想要使用类似的东西：

for start, end in zip(times, times[1:]):
    slice = df[(start <= df.mydate.dt.time) & (df.mydate.dt.time <= end)]

As a final note, what you are trying to accomplish with the for loop can be obtained using the group byoperations from pandas. You just need to prepare a mytimecolumn with the times only:

最后要注意的是，您可以使用来自 Pandas的group by操作来获得您尝试使用 for 循环完成的任务。你只需要准备一个mytime只有时间的列：

df['mytime'] = df.mydate.dt.time
groups = df.groupby('mytime')

for group_key, group_df in groups:
    print(group_key, group_df, sep='\n')
>>>
09:15:00
   col1              mydate    mytime
0     1 2016-12-25 09:15:00  09:15:00
11:15:00
   col1              mydate    mytime
1     2 2016-12-25 11:15:00  11:15:00
2     3 2016-12-26 11:15:00  11:15:00
12:15:00
   col1              mydate    mytime
3     2 2016-12-26 12:15:00  12:15:00

The nice thing is that you don't need to operate on the single dataframes, but you can apply the same operations and aggregations on every group at the same time:

好处是您不需要对单个数据帧进行操作，但您可以同时对每个组应用相同的操作和聚合：

groups.size()
>>>
mytime
09:15:00    1
11:15:00    2
12:15:00    1

groups.sum()
>>>
          col1
mytime        
09:15:00     1
11:15:00     5
12:15:00     2

pandas 仅将日期时间列与熊猫中的时间进行比较

提问by pythonRcpp

采纳答案by jezrael

回答by baldassarreFe

相关推荐

最近更新

标签

pandas 仅将日期时间列与熊猫中的时间进行比较

提问by pythonRcpp

采纳答案by jezrael

回答by baldassarreFe

相关推荐

pandas Seaborn：我只想要一个对数刻度

Pandas 合并两个没有某些列的 DataFrame

检查列表中的单词并删除 Pandas 数据框列中的单词

在 Pandas 中，您可以按平均值聚合并将平均值四舍五入到最近的整数吗？

相关推荐

最近更新

标签