使用 Pandas 计算日期时间行平均值的最快方法

Question

提问by Amit Tripathi

I have 122864 row of data. I am storing data in HDF5 file. Using pandas for data processing. For each unique id in record there is a timestamp associated indicating time when user opened an app. I want to get average duration between two hits of app.

我有 122864 行数据。我将数据存储在 HDF5 文件中。使用Pandas进行数据处理。对于记录中的每个唯一 id，都有一个相关的时间戳，指示用户打开应用程序的时间。我想获得两次应用点击之间的平均持续时间。

1283    2015-04-01 08:07:44.131768
1284    2015-04-01 08:08:02.752611
1285    2015-04-01 08:08:02.793380
1286    2015-04-01 08:07:53.910469
1287    2015-04-01 08:08:03.305893
1288    2015-04-01 08:07:44.843050
1289    2015-04-01 08:07:54.767203
1290    2015-04-01 08:08:03.965367
1291    2015-04-01 08:07:45.924854
1292    2015-04-01 08:07:55.408593
1293    2015-04-01 08:07:46.365128

class User(object):

    ''' 
    Properties and function related to each object.

    attributes:

        datetime: a list of hit timestamp for each user object
        deviceid: unique deviceid
    '''

    def __init__(self, User, device_id):
        self.datetime = pd.to_datetime(list(User['datetime']))
        self.deviceid = device_id
        self.avrgtime = 0.0
        avgtime.setdefault(self.deviceid, 1)

    def avg_duration(self):

        '''
        average duration b/w hits for each user.
        '''

        for i,time in enumerate(self.datetime[:-1]):
            self.avrgtime += abs(self.datetime[i+1] - time).total_seconds()
        avgtime[self.deviceid] = self.avrgtime/len(self.datetime)
        pp.pprint(avgtime)
            #avgtime[] = datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')

        pass


def eachdevice(gstore):
    count = 0
    for did in list(gstore['data'].drop_duplicates('device_id')['device_id']):

     auser = gstore.select('data', where="device_id == did")
     gamer = User(auser, did) 
     gamer.avg_duration()
     count+=1
     print count



#main workshore
if __name__ == '__main__':

    try:
        path = os.path.abspath(sys.argv[1])
        with pd.HDFStore('Gamer.h5') as gstore:
            eachdevice(gstore)            

    except IndexError:
        print('\nPass path of the HDF5 file to be analyized...EXITING\n')

What I am doing till now is looping through each unique_id and using pandas dataframe select querying datetime for each unique id. This returns datetime object dataframe. I convert this to list and then loop to calculate average difference between two timestamps. This approach takes lots of time. Is there any way to do this in using pandas?

到目前为止，我正在做的是遍历每个 unique_id 并使用 pandas 数据框选择查询每个唯一 id 的日期时间。这将返回日期时间对象数据帧。我将其转换为列表，然后循环计算两个时间戳之间的平均差异。这种方法需要很多时间。有没有办法在使用Pandas时做到这一点？

Please help.

请帮忙。

EDIT: after commenting out all the calculation part i run the code. I think this auser = gstore.select('data', where="device_id == did")is taking all the time. How to improve? any alternative or better way? %timeit result :1 loops, best of 3: 13.3 s per loop for 1000 iterations.

编辑：在注释掉所有计算部分后，我运行了代码。我认为这个auser = gstore.select('data', where="device_id == did")一直在花时间。怎么提高？任何替代或更好的方法？%timeit 结果：1 次循环，最好的 3 次：每循环 13.3 秒，迭代 1000 次。

Edit: Sample data:

编辑：示例数据：

                           device_id                    datetime
0   c4be7e55d98914647c51329edc2ab734  2015-03-30 22:00:05.922317
1   05fed9f8e07c3cac457723286d36f621  2015-03-30 22:00:07.895672
2   783faeed9fe35a3f45b521b3a6667a2d  2015-03-30 22:00:05.529631
3   c2022ad838cec35bdb12fc3a6e2cf452  2015-03-30 21:59:59.043905
4   a8a04268ee0c22b26af59e053390cf6f  2015-03-30 22:00:14.248542
5   4e5ed16b44b9cd38c408859d1d241e2d  2015-03-30 22:00:02.391719
6   c0bfd3f9046855ffaaec4d99c367fd8c  2015-03-30 22:00:18.649193
7   95f1182c6e4d601ba0b20f5204168ecb  2015-03-30 22:00:13.629728
8   a85caa7e0a4a7d57e6330c083daff326  2015-03-30 22:00:08.340469
9   46cdbee963814cdb4e6a0ac0049b8fc6  2015-03-30 22:00:23.152820
10  3c8bf70679cd9c6f18aa52d06e0e181d  2015-03-30 22:00:17.619251
11  52bc4e3d9dc373d89ec31effe10e6f30  2015-03-30 22:00:11.591954
12  3477eb25e26b6bff0bfc6c3ee59a5f40  2015-03-30 22:00:25.745083
13  e7bf8ae864f2148831628a6f2e8e406e  2015-03-30 22:00:20.911568
14  a15af8faffd655a3e80f85840bbf3c2a  2015-03-30 22:00:19.017887
15  9d9f71f080c0cf478ec4117e78ff89ee  2015-03-30 22:00:28.435585
16  1633d88738316e3602890499b1f778b1  2015-03-30 22:00:24.108234
17  3362daf99f11541acbf45e70fdaf5f49  2015-03-30 22:00:24.512366
18  96c3c005eaaaa8d6af3f2443ca8f73df  2015-03-30 22:00:29.713550
19  002642b9ed495f84318fcb42557f53e1  2015-03-30 22:00:37.936647

Answer 1

回答by S Anand

Let's create a dummy dataset with 150000 rows similar to yours.

让我们创建一个与您类似的具有 150000 行的虚拟数据集。

>>> import pandas as pd
>>> data = pd.DataFrame({
...     'device_id': pd.np.random.randint(0, 100, 150000),
...     'datetime': pd.Series(pd.np.random.randint(1429449000, 1429649000, 150000) * 1E9).astype('datetime64[ns]')
... }).sort('datetime')
>>> data.head()
                  datetime  device_id
113719 2015-04-19 13:10:00         34
120323 2015-04-19 13:10:01         22
91342  2015-04-19 13:10:04          9
61170  2015-04-19 13:10:08         27
103748 2015-04-19 13:10:11         65

You can use .groupbyto pre-compute groups. This lets you easily identify all datetimes for a given device_id.

您可以使用.groupby预先计算组。这使您可以轻松识别datetime给定的所有s device_id。

>>> groups = data.groupby('device_id')
>>> data.ix[groups.groups.get(34)].head()   # Get the data for device_id = 34
                  datetime  device_id
113719 2015-04-19 13:10:00         34
105761 2015-04-19 13:11:30         34
85903  2015-04-19 13:18:40         34
36395  2015-04-19 13:19:55         34
108850 2015-04-19 13:20:06         34

From here, it's quick enough to identify the average differences.

从这里，它足以快速识别平均差异。

>>> def mean_diff(device_id):
...     return data['datetime'][groups.groups.get(device_id)].diff().mean()
...
>>> mean_diff(34)
Timedelta('0 days 00:02:14.470746')

Since the .groupbypre-computes the results, every successive lookup is quite fast. This step takes about 2 milliseconds on the 150000 rows.

由于.groupby预先计算结果，因此每次连续查找都非常快。这一步在 150000 行上大约需要 2 毫秒。

In [68]: %timeit mean_diff(34)
100 loops, best of 3: 2.03 ms per loop

You can also compute this on all device_idlike this:

您还可以device_id像这样计算所有内容：

>>> time_diff = groups.apply(lambda df: df.datetime.diff().mean())
>>> time_diff.head()
device_id
0   00:02:12.871504
1   00:02:10.464099
2   00:02:09.550000
3   00:02:15.845003
4   00:02:14.642375
dtype: timedelta64[ns]

This is pretty fast too. For these 150,000 rows, it takes under 50ms. Of course, your mileage may vary.

这也很快。对于这 150,000 行，需要不到 50 毫秒。当然，您的里程可能会有所不同。

In [79]: %timeit groups.apply(lambda df: df.datetime.diff().mean())
10 loops, best of 3: 46.6 ms per loop

Answer 2

回答by Alexander

To get a dictionary of average difference between Timestamps for unique user IDs

获取唯一用户 ID 时间戳之间平均差异的字典

device_ids = df.device_id.unique()
device_tdelta = {device: df.loc[df.device_id == device, 'datetime'].diff().mean() 
                         for device in df.device_id.unique()}

You then need to convert these timedeltas to seconds:

然后您需要将这些 timedeltas 转换为秒：

from pandas.tslib import NaT

device_seconds = {device: ts.total_seconds() 
                          if not isinstance(ts, pd.tslib.NaTType) 
                          else NaT 
                          for device, ts in device_tdelta.iteritems()}

If the datetime column is in the form of a string, the first need to be converted to Pandas Timestamps.

如果 datetime 列是字符串形式，首先需要转换为 Pandas Timestamps。

df.datetime = [pd.Timestamp(ts) for ts in df.datetime]

使用 Pandas 计算日期时间行平均值的最快方法

提问by Amit Tripathi

回答by S Anand

回答by Alexander

相关推荐

最近更新

标签

使用 Pandas 计算日期时间行平均值的最快方法

提问by Amit Tripathi

回答by S Anand

回答by Alexander

相关推荐

pandas 熊猫按降序绘制 x 或 index_column

如何将 Pandas 中的变量指定为有序/分类？

pandas 在网格中绘制多个直方图

pandas 熊猫读取没有标题的 csv（可能在那里）

相关推荐

最近更新

标签