pandas 使用日期时间索引提高大熊猫 read_csv 的速度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14446744/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speed-improvement on large pandas read_csv with datetime index
提问by Michael WS
I have enormous files that look like this:
我有巨大的文件,看起来像这样:
05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25
I can easily read them using the following:
我可以使用以下方法轻松阅读它们:
pd.read_csv(gzip.open("myfile.gz"), header=None,names=
["date","time","price","size","type","zero","empty","last"], parse_dates=[[0,1]])
Is there any way to efficiently parse dates like this into pandas timestamps? If not, is there any guide for writing a cython function that can passed to date_parser= ?
有没有办法有效地将这样的日期解析为Pandas时间戳?如果没有,是否有任何指南可以编写可以传递给 date_parser= 的 cython 函数?
I tried writing my own parser function and it still takes too long for the project I am working on.
我尝试编写自己的解析器函数,但我正在处理的项目仍然需要很长时间。
采纳答案by Vladimir
An improvement of previous solution of Michael WS:
Michael WS先前解决方案的改进:
- conversion to
pandas.Timestampis better to perform outside the Cython code atoiand processing native-c strings is a little-bit faster than python funcs- the number of
datetime-lib calls is reduced to one from 2 (+1 occasional for date) - microseconds are also processed
- 转换
pandas.Timestamp为更好地在 Cython 代码之外执行 atoi并且处理 native-c 字符串比 python funcs 快一点datetime-lib 调用的数量从 2 减少到 1(日期偶尔为 +1)- 微秒也被处理
NB! The date order in this code is day/month/year.
注意!此代码中的日期顺序是日/月/年。
All in all the code seems to be approximately 10 times faster than the original convert_date_cython. However if this is called after read_csvthen on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.
总而言之,代码似乎比原始convert_date_cython. 但是,如果read_csv在 SSD 硬盘驱动器上调用它,则由于读取开销,总时间的差异仅为百分之几。我猜想,在普通硬盘上,差异会更小。
cimport numpy as np
import datetime
import numpy as np
import pandas as pd
from libc.stdlib cimport atoi, malloc, free
from libc.string cimport strcpy
### Modified code from Michael WS:
### https://stackoverflow.com/a/15812787/2447082
def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
cdef int N = len(date_vec)
cdef np.ndarray out_ar = np.empty(N, dtype=np.object)
cdef bytes prev_date = <bytes> 'xx/xx/xxxx'
cdef char *date_str = <char *> malloc(20)
cdef char *time_str = <char *> malloc(20)
for i in range(N):
if date_vec[i] != prev_date:
prev_date = date_vec[i]
strcpy(date_str, prev_date) ### xx/xx/xxxx
date_str[2] = 0
date_str[5] = 0
d_year = atoi(date_str+6)
d_month = atoi(date_str+3)
d_day = atoi(date_str)
strcpy(time_str, time_vec[i]) ### xx:xx:xx:xxxxxx
time_str[2] = 0
time_str[5] = 0
time_str[8] = 0
t_hour = atoi(time_str)
t_min = atoi(time_str+3)
t_sec = atoi(time_str+6)
t_ms = atoi(time_str+9)
out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
free(date_str)
free(time_str)
return pd.to_datetime(out_ar)
回答by Michael WS
I got an incredible speedup (50X) with the following cython code:
我使用以下 cython 代码获得了令人难以置信的加速(50 倍):
call from python: timestamps = convert_date_cython(df["date"].values, df["time"].values)
来自 python 的调用:timestamps = convert_date_cython(df["date"].values, df["time"].values)
cimport numpy as np
import pandas as pd
import datetime
import numpy as np
def convert_date_cython(np.ndarray date_vec, np.ndarray time_vec):
cdef int i
cdef int N = len(date_vec)
cdef out_ar = np.empty(N, dtype=np.object)
date = None
for i in range(N):
if date is None or date_vec[i] != date_vec[i - 1]:
dt_ar = map(int, date_vec[i].split("/"))
date = datetime.date(dt_ar[2], dt_ar[0], dt_ar[1])
time_ar = map(int, time_vec[i].split(".")[0].split(":"))
time = datetime.time(time_ar[0], time_ar[1], time_ar[2])
out_ar[i] = pd.Timestamp(datetime.datetime.combine(date, time))
return out_ar
回答by blurrcat
The cardinality of datetime strings is not huge. For example, number of time strings in the format %H-%M-%Sis 24 * 60 * 60 = 86400. If the number of rows of your dataset is much larger than this or your data contains lots of duplicate timestamps, adding a cache in the parsing process could substantially speed things up.
日期时间字符串的基数并不大。例如,格式中的时间字符串数%H-%M-%S为24 * 60 * 60 = 86400。如果数据集的行数远大于此值,或者您的数据包含大量重复时间戳,则在解析过程中添加缓存可以大大加快速度。
For those who do not have Cython available, here's an alternative solution in pure python:
对于那些没有可用 Cython 的人,这里有一个纯 python 的替代解决方案:
import numpy as np
import pandas as pd
from datetime import datetime
def parse_datetime(dt_array, cache=None):
if cache is None:
cache = {}
date_time = np.empty(dt_array.shape[0], dtype=object)
for i, (d_str, t_str) in enumerate(dt_array):
try:
year, month, day = cache[d_str]
except KeyError:
year, month, day = [int(item) for item in d_str[:10].split('-')]
cache[d_str] = year, month, day
try:
hour, minute, sec = cache[t_str]
except KeyError:
hour, minute, sec = [int(item) for item in t_str.split(':')]
cache[t_str] = hour, minute, sec
date_time[i] = datetime(year, month, day, hour, minute, sec)
return pd.to_datetime(date_time)
def read_csv(filename, cache=None):
df = pd.read_csv(filename)
df['date_time'] = parse_datetime(df.loc[:, ['date', 'time']].values, cache=cache)
return df.set_index('date_time')
With the following particular data set, the speedup is 150x+:
使用以下特定数据集,加速比为 150 倍以上:
$ ls -lh test.csv
-rw-r--r-- 1 blurrcat blurrcat 1.2M Apr 8 12:06 test.csv
$ head -n 4 data/test.csv
user_id,provider,date,time,steps
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:47:00,25
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:09:00,4
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,19:10:00,67
In ipython:
在 ipython 中:
In [1]: %timeit pd.read_csv('test.csv', parse_dates=[['date', 'time']])
1 loops, best of 3: 10.3 s per loop
In [2]: %timeit read_csv('test.csv', cache={})
1 loops, best of 3: 62.6 ms per loop
To limit memory usage, simply replace the dict cache with something like a LRU.
要限制内存使用,只需用 LRU 之类的东西替换 dict 缓存。

