使用 Pandas 计算日期时间行平均值的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29722704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to calculate average of datetime rows using pandas
提问by Amit Tripathi
I have 122864 row of data. I am storing data in HDF5 file. Using pandas for data processing. For each unique id in record there is a timestamp associated indicating time when user opened an app. I want to get average duration between two hits of app.
我有 122864 行数据。我将数据存储在 HDF5 文件中。使用Pandas进行数据处理。对于记录中的每个唯一 id,都有一个相关的时间戳,指示用户打开应用程序的时间。我想获得两次应用点击之间的平均持续时间。
1283 2015-04-01 08:07:44.131768
1284 2015-04-01 08:08:02.752611
1285 2015-04-01 08:08:02.793380
1286 2015-04-01 08:07:53.910469
1287 2015-04-01 08:08:03.305893
1288 2015-04-01 08:07:44.843050
1289 2015-04-01 08:07:54.767203
1290 2015-04-01 08:08:03.965367
1291 2015-04-01 08:07:45.924854
1292 2015-04-01 08:07:55.408593
1293 2015-04-01 08:07:46.365128
class User(object):
'''
Properties and function related to each object.
attributes:
datetime: a list of hit timestamp for each user object
deviceid: unique deviceid
'''
def __init__(self, User, device_id):
self.datetime = pd.to_datetime(list(User['datetime']))
self.deviceid = device_id
self.avrgtime = 0.0
avgtime.setdefault(self.deviceid, 1)
def avg_duration(self):
'''
average duration b/w hits for each user.
'''
for i,time in enumerate(self.datetime[:-1]):
self.avrgtime += abs(self.datetime[i+1] - time).total_seconds()
avgtime[self.deviceid] = self.avrgtime/len(self.datetime)
pp.pprint(avgtime)
#avgtime[] = datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')
pass
def eachdevice(gstore):
count = 0
for did in list(gstore['data'].drop_duplicates('device_id')['device_id']):
auser = gstore.select('data', where="device_id == did")
gamer = User(auser, did)
gamer.avg_duration()
count+=1
print count
#main workshore
if __name__ == '__main__':
try:
path = os.path.abspath(sys.argv[1])
with pd.HDFStore('Gamer.h5') as gstore:
eachdevice(gstore)
except IndexError:
print('\nPass path of the HDF5 file to be analyized...EXITING\n')
What I am doing till now is looping through each unique_id and using pandas dataframe select querying datetime for each unique id. This returns datetime object dataframe. I convert this to list and then loop to calculate average difference between two timestamps. This approach takes lots of time. Is there any way to do this in using pandas?
到目前为止,我正在做的是遍历每个 unique_id 并使用 pandas 数据框选择查询每个唯一 id 的日期时间。这将返回日期时间对象数据帧。我将其转换为列表,然后循环计算两个时间戳之间的平均差异。这种方法需要很多时间。有没有办法在使用Pandas时做到这一点?
Please help.
请帮忙。
EDIT: after commenting out all the calculation part i run the code. I think this auser = gstore.select('data', where="device_id == did")is taking all the time. How to improve? any alternative or better way? %timeit result :1 loops, best of 3: 13.3 s per loop for 1000 iterations.
编辑:在注释掉所有计算部分后,我运行了代码。我认为这个auser = gstore.select('data', where="device_id == did")一直在花时间。怎么提高?任何替代或更好的方法?%timeit 结果:1 次循环,最好的 3 次:每循环 13.3 秒,迭代 1000 次。
Edit: Sample data:
编辑:示例数据:
device_id datetime
0 c4be7e55d98914647c51329edc2ab734 2015-03-30 22:00:05.922317
1 05fed9f8e07c3cac457723286d36f621 2015-03-30 22:00:07.895672
2 783faeed9fe35a3f45b521b3a6667a2d 2015-03-30 22:00:05.529631
3 c2022ad838cec35bdb12fc3a6e2cf452 2015-03-30 21:59:59.043905
4 a8a04268ee0c22b26af59e053390cf6f 2015-03-30 22:00:14.248542
5 4e5ed16b44b9cd38c408859d1d241e2d 2015-03-30 22:00:02.391719
6 c0bfd3f9046855ffaaec4d99c367fd8c 2015-03-30 22:00:18.649193
7 95f1182c6e4d601ba0b20f5204168ecb 2015-03-30 22:00:13.629728
8 a85caa7e0a4a7d57e6330c083daff326 2015-03-30 22:00:08.340469
9 46cdbee963814cdb4e6a0ac0049b8fc6 2015-03-30 22:00:23.152820
10 3c8bf70679cd9c6f18aa52d06e0e181d 2015-03-30 22:00:17.619251
11 52bc4e3d9dc373d89ec31effe10e6f30 2015-03-30 22:00:11.591954
12 3477eb25e26b6bff0bfc6c3ee59a5f40 2015-03-30 22:00:25.745083
13 e7bf8ae864f2148831628a6f2e8e406e 2015-03-30 22:00:20.911568
14 a15af8faffd655a3e80f85840bbf3c2a 2015-03-30 22:00:19.017887
15 9d9f71f080c0cf478ec4117e78ff89ee 2015-03-30 22:00:28.435585
16 1633d88738316e3602890499b1f778b1 2015-03-30 22:00:24.108234
17 3362daf99f11541acbf45e70fdaf5f49 2015-03-30 22:00:24.512366
18 96c3c005eaaaa8d6af3f2443ca8f73df 2015-03-30 22:00:29.713550
19 002642b9ed495f84318fcb42557f53e1 2015-03-30 22:00:37.936647
回答by S Anand
Let's create a dummy dataset with 150000 rows similar to yours.
让我们创建一个与您类似的具有 150000 行的虚拟数据集。
>>> import pandas as pd
>>> data = pd.DataFrame({
... 'device_id': pd.np.random.randint(0, 100, 150000),
... 'datetime': pd.Series(pd.np.random.randint(1429449000, 1429649000, 150000) * 1E9).astype('datetime64[ns]')
... }).sort('datetime')
>>> data.head()
datetime device_id
113719 2015-04-19 13:10:00 34
120323 2015-04-19 13:10:01 22
91342 2015-04-19 13:10:04 9
61170 2015-04-19 13:10:08 27
103748 2015-04-19 13:10:11 65
You can use .groupbyto pre-compute groups. This lets you easily identify all datetimes for a given device_id.
您可以使用.groupby预先计算组。这使您可以轻松识别datetime给定 的所有s device_id。
>>> groups = data.groupby('device_id')
>>> data.ix[groups.groups.get(34)].head() # Get the data for device_id = 34
datetime device_id
113719 2015-04-19 13:10:00 34
105761 2015-04-19 13:11:30 34
85903 2015-04-19 13:18:40 34
36395 2015-04-19 13:19:55 34
108850 2015-04-19 13:20:06 34
From here, it's quick enough to identify the average differences.
从这里,它足以快速识别平均差异。
>>> def mean_diff(device_id):
... return data['datetime'][groups.groups.get(device_id)].diff().mean()
...
>>> mean_diff(34)
Timedelta('0 days 00:02:14.470746')
Since the .groupbypre-computes the results, every successive lookup is quite fast. This step takes about 2 milliseconds on the 150000 rows.
由于.groupby预先计算结果,因此每次连续查找都非常快。这一步在 150000 行上大约需要 2 毫秒。
In [68]: %timeit mean_diff(34)
100 loops, best of 3: 2.03 ms per loop
You can also compute this on all device_idlike this:
您还可以device_id像这样计算所有内容:
>>> time_diff = groups.apply(lambda df: df.datetime.diff().mean())
>>> time_diff.head()
device_id
0 00:02:12.871504
1 00:02:10.464099
2 00:02:09.550000
3 00:02:15.845003
4 00:02:14.642375
dtype: timedelta64[ns]
This is pretty fast too. For these 150,000 rows, it takes under 50ms. Of course, your mileage may vary.
这也很快。对于这 150,000 行,需要不到 50 毫秒。当然,您的里程可能会有所不同。
In [79]: %timeit groups.apply(lambda df: df.datetime.diff().mean())
10 loops, best of 3: 46.6 ms per loop
回答by Alexander
To get a dictionary of average difference between Timestamps for unique user IDs
获取唯一用户 ID 时间戳之间平均差异的字典
device_ids = df.device_id.unique()
device_tdelta = {device: df.loc[df.device_id == device, 'datetime'].diff().mean()
for device in df.device_id.unique()}
You then need to convert these timedeltas to seconds:
然后您需要将这些 timedeltas 转换为秒:
from pandas.tslib import NaT
device_seconds = {device: ts.total_seconds()
if not isinstance(ts, pd.tslib.NaTType)
else NaT
for device, ts in device_tdelta.iteritems()}
If the datetime column is in the form of a string, the first need to be converted to Pandas Timestamps.
如果 datetime 列是字符串形式,首先需要转换为 Pandas Timestamps。
df.datetime = [pd.Timestamp(ts) for ts in df.datetime]

