pandas read_csv 转换器的性能问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42462906/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas read_csv converters performance issue
提问by pansen
Description
描述
While reading large csv files (several million rows of mixed data), I used the converters
parameter of pandas' read_csv
method to conveniently pass functions that convert strings to datetime objects etc.
在读取大型 csv 文件(几百万行混合数据)时,我使用了converters
pandasread_csv
方法的参数来方便地传递将字符串转换为日期时间对象等的函数。
However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.
但是,与手动转换相应列相比,使用转换器参数非常慢。
Example Code
示例代码
For illustration, let's convert strings to datetime objects using 3 different methods:
为了说明,让我们使用 3 种不同的方法将字符串转换为日期时间对象:
- converters parameter
- parse_dates/date_parser parameters
- manually after loading the csv
- 转换器参数
- parse_dates/date_parser 参数
- 加载 csv 后手动
Note, the conversion from string to datetime is arbitrary here. This could be replaced with other functions (except for not having the specific parse_dates/date_parser parameters).
注意,这里从字符串到日期时间的转换是任意的。这可以用其他函数替换(除了没有特定的 parse_dates/date_parser 参数)。
import pandas as pd # 0.19.2 with python 3.5
# create dummy data
rows = 100000
data = {"dates": pd.date_range("2017-02-27 20:44:23", periods=rows, freq="S")}
# save as temporary file for timeit
pd.DataFrame(data).to_csv("dummy")
# define converters
def convert_datetime(series):
return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")
Now, lets see the timeit (Ipython) comparisons:
现在,让我们看看 timeit (Ipython) 的比较:
%%timeit
df = pd.read_csv("dummy", converters={"dates": convert_datetime})
# 1 loop, best of 3: 7.76 s per loop
%%timeit
df = pd.read_csv("dummy", parse_dates=["dates"], date_parser=convert_datetime)
# 10 loops, best of 3: 125 ms per loop
%%timeit
df = pd.read_csv("dummy")
df["dates"] = convert_datetime(df["dates"])
# 10 loops, best of 3: 129 ms per loop
Result
结果
The converters' version is roughly 60 times slower than the other ones. To understand this a bit better, I wrapped the convert_datetime function into a little decorator class to count the number of calls:
转换器的版本大约比其他版本慢 60 倍。为了更好地理解这一点,我将 convert_datetime 函数包装到一个小装饰器类中来计算调用次数:
class Counter:
def __init__(self, func):
self.func = func
self.count = 0
def __call__(self, *args, **kwargs):
self.count += 1
return self.func(*args, **kwargs)
@Counter
def convert_datetime(series):
return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")
It revealed that using the converters' parameter calls the convert_datetime function for every single value whereas the other versions call the converter function only once. This explains the performance deficit.
它表明使用转换器的参数为每个值调用 convert_datetime 函数,而其他版本只调用一次转换器函数。这解释了性能缺陷。
Question
题
Why is that? I expect vectorized functions that are passed to the converters' parameter to be executed on all values at once instead of on each value individually.
这是为什么?我希望传递给转换器参数的矢量化函数一次对所有值执行,而不是单独对每个值执行。
采纳答案by miradulo
From the docs for read_csv
(emphasis mine),
来自read_csv
(强调我的)的文档,
converters
: dict, default None
- Dict of functions for converting valuesin certain columns. Keys can either be integers or column labels
converters
: dict, 默认无
- 用于转换某些列中的值的函数字典。键可以是整数或列标签
The idea of the converters
keyword parameter is to provide functions which act upon individual values, not entire columns. This can be seen by redefining the converter function
converters
关键字参数的想法是提供作用于单个值而不是整个列的函数。这可以通过重新定义转换器函数来看到
def convert_datetime(val):
return datetime.datetime.strptime(val, "%Y-%m-%d %H:%M:%S")
%timeit pd.read_csv("dummy", converters={"dates": convert_datetime})
1 loop, best of 3: 2.81 s per loop
And as you can imagine, this amounts to a lot of Python function calls.
您可以想象,这相当于大量 Python 函数调用。
As for whythe converters parameter doesn't accept vectorized functions, my best guesswould be that they offer somewhat less flexibility than the current implementation. With the idea being that you can parsenecessary columns of dates, etc. which may require some parsing logic with vectorized parse_dates
, date_parser
already, and most further column operations could be done post-read with vectorized methods.
至于为什么转换器参数不接受矢量化函数,我最好的猜测是它们提供的灵活性比当前的实现稍差。有了想法是,你可以分析这可能需要与矢量一些分析逻辑日期等必要的列parse_dates
,date_parser
已经和大多数进一步列操作可以做阅读后用矢量方法。
In other words, being able to parse elements which haveto end up in a different type (like datetimes) is convenient to have vectorized methods for in read_csv
. Beyond that, converters
is just a convenience parameter which can flexibly act on individual values - as further vectorized methods can just be done post-read anyways.
换句话说,能够解析必须以不同类型(如日期时间)结束的元素对于使用in 的矢量化方法很方便read_csv
。除此之外,converters
它只是一个方便的参数,可以灵活地作用于单个值——因为进一步的矢量化方法无论如何都可以在读后完成。