Python Pandas:创建日期时间索引的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11136006/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 15:45:05  来源:igfitidea点击:

Python Pandas: What is the fastest way to create a datetime index?

pythonperformanceparsingdatetimepandas

提问by user1412286

My data looks like so:

我的数据看起来像这样:

TEST
2012-05-01 00:00:00.203 OFF 0
2012-05-01 00:00:11.203 OFF 0
2012-05-01 00:00:22.203 ON 1
2012-05-01 00:00:33.203 ON 1
2012-05-01 00:00:44.203 OFF 0
TEST
2012-05-02 00:00:00.203 OFF 0
2012-05-02 00:00:11.203 OFF 0
2012-05-02 00:00:22.203 OFF 0
2012-05-02 00:00:33.203 ON 1
2012-05-02 00:00:44.203 ON 1
2012-05-02 00:00:55.203 OFF 0

I'm using pandasread_tableto read a pre-parsed string (which gets rid of the "TEST" lines) like so:

我正在使用像pandasread_table这样读取预先解析的字符串(它摆脱了“TEST”行):

df = pandas.read_table(buf, sep=' ', header=None, parse_dates=[[0, 1]], date_parser=dateParser, index_col=[0])

So far, i've tried several date parsers, the uncommented one being the fastest.

到目前为止,我已经尝试了几个日期解析器,没有注释的那个是最快的。

def dateParser(s):
#return datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f")
return datetime(int(s[0:4]), int(s[5:7]), int(s[8:10]), int(s[11:13]), int(s[14:16]), int(s[17:19]), int(s[20:23])*1000)
#return np.datetime64(s)
#return pandas.Timestamp(s, "%Y-%m-%d %H:%M:%S.%f", tz='utc' )

Is there anything else I can do to speed things up? I need to read large amounts of data - several Gb per file.

我还能做些什么来加快速度吗?我需要读取大量数据 - 每个文件几个 Gb。

回答by diliop

The quick answer is that what you indicate as the fastest way to parse your date/time strings into a datetime-type index, is indeed the fastest way. I timed some of your approaches and some others and this is what I get.

快速回答是,您所说的将日期/时间字符串解析为datetime-type 索引的最快方法确实是最快的方法。我对你的一些方法和其他一些方法进行了计时,这就是我得到的。

First,getting an example DataFrameto work with:

首先,得到一个例子DataFrame

import datetime
from pandas import *

start = datetime(2000, 1, 1)
end = datetime(2012, 12, 1)
d = DateRange(start, end, offset=datetools.Hour())
t_df = DataFrame({'field_1': np.array(['OFF', 'ON'])[np.random.random_integers(0, 1, d.size)], 'field_2': np.random.random_integers(0, 1, d.size)}, index=d)

Where:

在哪里:

In [1]: t_df.head()
Out[1]: 
                    field_1  field_2
2000-01-01 00:00:00      ON        1
2000-01-01 01:00:00     OFF        0
2000-01-01 02:00:00     OFF        1
2000-01-01 03:00:00     OFF        1
2000-01-01 04:00:00      ON        1
In [2]: t_df.shape
Out[2]: (113233, 2)

This is an approx. 3.2MB file if you dump it on disk. We now need to drop the DataRangetype of your Indexand make it a list of strto simulate how you would parse in your data:

这是一个大约。3.2MB 文件,如果你把它转储到磁盘上。我们现在需要删除DataRange您的类型Index并将其str设为列表以模拟您将如何解析您的数据:

t_df.index = t_df.index.map(str)

If you use parse_dates = Truewhen reading your data into a DataFrameusing read_tableyou are looking at 9.5secmean parse time:

如果parse_dates = True在将数据读入DataFrameusing 时使用read_table,则平均解析时间为9.5秒:

In [3]: import numpy as np
In [4]: import timeit
In [5]: t_df.to_csv('data.tsv', sep='\t', index_label='date_time')
In [6]: t = timeit.Timer("from __main__ import read_table; read_table('data.tsv', sep='\t', index_col=0, parse_dates=True)")
In [7]: np.mean(t.repeat(10, number=1))
Out[7]: 9.5226533889770515

The other strategies rely on parsing your data into a DataFramefirst (negligible parsing time) and then converting your index to an Indexof datetimeobjects:

其他策略依赖于将您的数据解析为DataFrame第一个(解析时间可以忽略不计),然后将您的索引转换Indexdatetime对象:

In [8]: t = timeit.Timer("from __main__ import t_df, dateutil; map(dateutil.parser.parse, t_df.index.values)")
In [9]: np.mean(t.repeat(10, number=1))
Out[9]: 7.6590064525604244
In [10]: t = timeit.Timer("from __main__ import t_df, dateutil; t_df.index.map(dateutil.parser.parse)")
In [11]: np.mean(t.repeat(10, number=1))
Out[11]: 7.8106775999069216
In [12]: t = timeit.Timer("from __main__ import t_df, datetime; t_df.index.map(lambda x: datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\"))")
Out[12]: 2.0389052629470825
In [13]: t = timeit.Timer("from __main__ import t_df, np; map(np.datetime_, t_df.index.values)")
In [14]: np.mean(t.repeat(10, number=1))
Out[14]: 3.8656840562820434
In [15]: t = timeit.Timer("from __main__ import t_df, np; map(np.datetime64, t_df.index.values)")
In [16]: np.mean(t.repeat(10, number=1))
Out[16]: 3.9244711160659791

And now for the winner:

现在是获胜者:

In [17]: def f(s):
   ....:         return datetime(int(s[0:4]), 
   ....:                     int(s[5:7]), 
   ....:                     int(s[8:10]), 
   ....:                     int(s[11:13]), 
   ....:                     int(s[14:16]), 
   ....:                     int(s[17:19]))
   ....: t = timeit.Timer("from __main__ import t_df, f; t_df.index.map(f)")
   ....: 
In [18]: np.mean(t.repeat(10, number=1))
Out[18]: 0.33927145004272463

When working with numpy, pandasor datetime-type approaches, there definitely might be more optimizations to think of but it seems to me that staying with CPython's standard libraries and converting each date/time strinto a tupple of ints and that into a datetimeinstance is the fastest way to get what you want.

使用numpy,pandasdatetime-type 方法时,肯定可能需要考虑更多优化,但在我看来,使用 CPython 的标准库并将每个日期/时间str转换为ints 元组并将其转换为datetime实例是最快的方法得到你想要的。