为什么 pandas.to_datetime 对于非标准时间格式(例如“2014/12/31”)很慢
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32034689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'
提问by liubenyuan
I have a .csv file in such format
我有一个这种格式的 .csv 文件
timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...
and when read via pd.read_csvand convert the time str to datetime using pd.to_datetime, the performance drops dramatically. Here is a minimal example.
当通过读取pd.read_csv并将时间 str 转换为 datetime 时pd.to_datetime,性能会急剧下降。这是一个最小的例子。
import re
import pandas as pd
d = '2014-12-12 01:02:03.0030'
c = re.sub('-', '/', d)
%timeit pd.to_datetime(d)
%timeit pd.to_datetime(c)
%timeit pd.to_datetime(c, format="%Y/%m/%d %H:%M:%S.%f")
and the performances are:
表演是:
10000 loops, best of 3: 62.4 μs per loop
10000 loops, best of 3: 181 μs per loop
10000 loops, best of 3: 82.9 μs per loop
so, how could I improve the performance of pd.to_datetimewhen reading date from a csv file?
那么,如何提高pd.to_datetime从 csv 文件读取日期的性能?
回答by joris
This is because pandas falls back to dateutil.parser.parsefor parsing the strings when it has a non-default format or when no formatstring is supplied (this is much more flexible, but also slower).
这是因为dateutil.parser.parse当 Pandas 具有非默认格式或没有format提供字符串时,它会回退到解析字符串(这更灵活,但也更慢)。
As you have shown above, you can improve the performance by supplying a formatstring to to_datetime. Or another option is to use infer_datetime_format=True
如上所示,您可以通过向 提供format字符串来提高性能to_datetime。或者另一种选择是使用infer_datetime_format=True
Apparently, the infer_datetime_formatcannot infer when there are microseconds. With an example without those, you can see a large speed-up:
显然,infer_datetime_format当有微秒时无法推断。通过一个没有这些的例子,你可以看到一个很大的加速:
In [28]: d = '2014-12-24 01:02:03'
In [29]: c = re.sub('-', '/', d)
In [30]: s_c = pd.Series([c]*10000)
In [31]: %timeit pd.to_datetime(s_c)
1 loops, best of 3: 1.14 s per loop
In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True)
10 loops, best of 3: 105 ms per loop
In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S")
10 loops, best of 3: 99.5 ms per loop
回答by Zach
This question has already been sufficiently answered, but I wanted to add in the results of some tests I was running to optimize my own code.
这个问题已经得到了充分的回答,但我想添加一些我正在运行的测试结果来优化我自己的代码。
I was getting this format from an API: "Wed Feb 08 17:58:56 +0000 2017".
我是从 API 获取这种格式的:“Wed Feb 08 17:58:56 +0000 2017”。
Using the default pd.to_datetime(SERIES)with an implicit conversion, it was taking over an hour to process roughly 20 million rows (depending on how much free memory I had).
使用pd.to_datetime(SERIES)带有隐式转换的默认值,处理大约 2000 万行需要一个多小时(取决于我有多少可用内存)。
That said, I tested three different conversions:
也就是说,我测试了三种不同的转换:
# explicit conversion of essential information only -- parse dt str: concat
def format_datetime_1(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')
return dt_series
# explicit conversion of what datetime considers "essential date representation" -- parse dt str: del then join
def format_datetime_2(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
del split_date[4]
str_date = ' '.join(str(s) for s in split_date)
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
return dt_series
# explicit conversion of what datetime considers "essential date representation" -- parse dt str: concat
def format_datetime_3(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
str_date = split_date[0] + ' ' + split_date[1] + ' ' + split_date[2] + ' ' + split_date[3] + ' ' + split_date[5]
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
return dt_series
# implicit conversion
def format_datetime_baseline(dt_series):
return pd.to_datetime(dt_series)
This was the results:
这是结果:
# sample of 250k rows
dt_series_sample = df['created_at'][:250000]
%timeit format_datetime_1(dt_series_sample) # best of 3: 1.56 s per loop
%timeit format_datetime_2(dt_series_sample) # best of 3: 2.09 s per loop
%timeit format_datetime_3(dt_series_sample) # best of 3: 1.72 s per loop
%timeit format_datetime_baseline(dt_series_sample) # best of 3: 1min 9s per loop
The first test results in an impressive 97.7% runtime reduction!
第一次测试的结果是令人印象深刻的 97.7% 的运行时间减少!
Somewhat surprisingly, it looks like even the "appropriate representation" takes longer, probably because it is semi-implicit.
有点令人惊讶的是,看起来即使是“适当的表示”也需要更长的时间,可能是因为它是半隐式的。
Conclusion: the more explicit you are, the faster it will run.
结论:你越明确,它运行得越快。
回答by C8H10N4O2
Often I am unable to specify a standard date format ahead of time because I simply do not know how each client will choose to submit it. The dates are unpredictably formatted and often missing.
通常我无法提前指定标准日期格式,因为我根本不知道每个客户将如何选择提交它。日期的格式不可预测并且经常丢失。
In these cases, instead of using pd.to_datetime, I have found it more efficient to write my own wrapper to dateutil.parser.parse:
在这些情况下,pd.to_datetime我发现将自己的包装器编写为dateutil.parser.parse:
import pandas as pd
from dateutil.parser import parse
import numpy as np
def parseDateStr(s):
if s != '':
try:
return np.datetime64(parse(s))
except ValueError:
return np.datetime64('NaT')
else: return np.datetime64('NaT')
# Example data:
someSeries=pd.Series( ['NotADate','','1-APR-16']*10000 )
# Compare times:
%timeit pd.to_datetime(someSeries, errors='coerce') #1 loop, best of 3: 1.78 s per loop
%timeit someSeries.apply(parseDateStr) #1 loop, best of 3: 904 ms per loop
# The approaches return identical results:
someSeries.apply(parseDateStr).equals(pd.to_datetime(someSeries, errors='coerce')) # True
In this case the runtime is cut in half, but YMMV.
在这种情况下,运行时间减少了一半,但 YMMV。

