pandas read_csv 转换器的性能问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42462906/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:04:42  来源:igfitidea点击:

pandas read_csv converters performance issue

pythonperformancecsvpandasdataframe

提问by pansen

Description

描述

While reading large csv files (several million rows of mixed data), I used the convertersparameter of pandas' read_csvmethod to conveniently pass functions that convert strings to datetime objects etc.

在读取大型 csv 文件(几百万行混合数据)时,我使用了converterspandasread_csv方法的参数来方便地传递将字符串转换为日期时间对象等的函数。

However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.

但是,与手动转换相应列相比,使用转换器参数非常慢。

Example Code

示例代码

For illustration, let's convert strings to datetime objects using 3 different methods:

为了说明,让我们使用 3 种不同的方法将字符串转换为日期时间对象:

  • converters parameter
  • parse_dates/date_parser parameters
  • manually after loading the csv
  • 转换器参数
  • parse_dates/date_parser 参数
  • 加载 csv 后手动

Note, the conversion from string to datetime is arbitrary here. This could be replaced with other functions (except for not having the specific parse_dates/date_parser parameters).

注意,这里从字符串到日期时间的转换是任意的。这可以用其他函数替换(除了没有特定的 parse_dates/date_parser 参数)。

import pandas as pd # 0.19.2 with python 3.5

# create dummy data
rows = 100000
data = {"dates": pd.date_range("2017-02-27 20:44:23", periods=rows, freq="S")}

# save as temporary file for timeit
pd.DataFrame(data).to_csv("dummy")

# define converters
def convert_datetime(series):
    return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")

Now, lets see the timeit (Ipython) comparisons:

现在,让我们看看 timeit (Ipython) 的比较:

%%timeit
df = pd.read_csv("dummy", converters={"dates": convert_datetime})
# 1 loop, best of 3: 7.76 s per loop


%%timeit
df = pd.read_csv("dummy", parse_dates=["dates"], date_parser=convert_datetime)
# 10 loops, best of 3: 125 ms per loop


%%timeit
df = pd.read_csv("dummy")
df["dates"] = convert_datetime(df["dates"])
# 10 loops, best of 3: 129 ms per loop

Result

结果

The converters' version is roughly 60 times slower than the other ones. To understand this a bit better, I wrapped the convert_datetime function into a little decorator class to count the number of calls:

转换器的版本大约比其他版本慢 60 倍。为了更好地理解这一点,我将 convert_datetime 函数包装到一个小装饰器类中来计算调用次数:

class Counter:
   def __init__(self, func):
       self.func = func
       self.count = 0

   def __call__(self, *args, **kwargs):
       self.count += 1
       return self.func(*args, **kwargs)

@Counter
def convert_datetime(series):
    return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")

It revealed that using the converters' parameter calls the convert_datetime function for every single value whereas the other versions call the converter function only once. This explains the performance deficit.

它表明使用转换器的参数为每个值调用 convert_datetime 函数,而其他版本只调用一次转换器函数。这解释了性能缺陷。

Question

Why is that? I expect vectorized functions that are passed to the converters' parameter to be executed on all values at once instead of on each value individually.

这是为什么?我希望传递给转换器参数的矢量化函数一次对所有值执行,而不是单独对每个值执行。

采纳答案by miradulo

From the docs for read_csv(emphasis mine),

来自read_csv强调我的)的文档,

converters: dict, default None

  • Dict of functions for converting valuesin certain columns. Keys can either be integers or column labels

converters: dict, 默认无

  • 用于转换某些列中的的函数字典。键可以是整数或列标签

The idea of the converterskeyword parameter is to provide functions which act upon individual values, not entire columns. This can be seen by redefining the converter function

converters关键字参数的想法是提供作用于单个值而不是整个列的函数。这可以通过重新定义转换器函数来看到

def convert_datetime(val):
    return datetime.datetime.strptime(val, "%Y-%m-%d %H:%M:%S")
%timeit pd.read_csv("dummy", converters={"dates": convert_datetime})
1 loop, best of 3: 2.81 s per loop

And as you can imagine, this amounts to a lot of Python function calls.

您可以想象,这相当于大量 Python 函数调用。



As for whythe converters parameter doesn't accept vectorized functions, my best guesswould be that they offer somewhat less flexibility than the current implementation. With the idea being that you can parsenecessary columns of dates, etc. which may require some parsing logic with vectorized parse_dates, date_parseralready, and most further column operations could be done post-read with vectorized methods.

至于为什么转换器参数不接受矢量化函数,我最好的猜测是它们提供的灵活性比当前的实现稍差。有了想法是,你可以分析这可能需要与矢量一些分析逻辑日期等必要的列parse_datesdate_parser已经和大多数进一步列操作可以做阅读后用矢量方法。

In other words, being able to parse elements which haveto end up in a different type (like datetimes) is convenient to have vectorized methods for in read_csv. Beyond that, convertersis just a convenience parameter which can flexibly act on individual values - as further vectorized methods can just be done post-read anyways.

换句话说,能够解析必须以不同类型(如日期时间)结束的元素对于使用in 的矢量化方法很方便read_csv。除此之外,converters它只是一个方便的参数,可以灵活地作用于单个值——因为进一步的矢量化方法无论如何都可以在读后完成。