pandas 熊猫：缓慢的日期转换

Question

提问by ppaulojr

I'm reading a huge CSVwith a date field in the format YYYYMMDDand I'm using the following lambda to convert it when reading:

我正在阅读一个CSV带有日期字段格式的巨大文件，YYYYMMDD并且在阅读时使用以下 lambda 来转换它：

import pandas as pd

df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=lambda t:pd.to_datetime(str(t),
                                            format='%Y%m%d', coerce=True))

This function is very slow though.

不过这个功能很慢。

Any suggestion to improve it?

有什么建议可以改进它吗？

Answer 1

回答by fixxxer

Try using this function for parsing dates:

尝试使用此函数来解析日期：

def lookup(s):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date) for date in s.unique()}
    return s.map(dates)

Use it like:

像这样使用它：

df['date-column'] = lookup(df['date-column'])

Benchmarks:

基准：

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

Edit: It would be wise to pass the expected date format to the to_datetimefunction.

编辑：将预期的日期格式传递给to_datetime函数是明智的。

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

来源：https: //github.com/sanand0/benchmarks/tree/master/date-parse

Answer 2

回答by Sam Davey

Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=Truecan be significantlyfaster. Below is my example.

好建议@EdChum！正如@EdChum 所建议的那样，使用速度infer_datetime_format=True可以明显加快。下面是我的例子。

I have a file of temperature data from a sensor log, which looks like this:

我有一个来自传感器日志的温度数据文件，如下所示：

RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...

RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...

My code reads the csv and parses the date (parse_dates=['Date']). With infer_datetime_format=False, it takes 8min 8sec:

我的代码读取 csv 并解析日期 ( parse_dates=['Date'])。使用infer_datetime_format=False，需要8 分 8 秒：

Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.

Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.

With infer_datetime_format=True, it takes 13sec:

使用infer_datetime_format=True，需要13 秒：

Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.

Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.

Answer 3

回答by firelynx

Streamlined date parsing with caching

使用缓存简化日期解析

Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.

读取所有数据然后转换它总是比在读取 CSV 时转换要慢。因为如果立即执行，则不需要对所有数据进行两次迭代。您也不必将其作为字符串存储在内存中。

We can define our own date parser that utilizes a cache for the dates it has already seen.

我们可以定义我们自己的日期解析器，它利用缓存来处理它已经看到的日期。

import pandas as pd

cache = {}

def cached_date_parser(s):
    if s in cache:
        return cache[s]
    dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
    cache[s] = dt
    return dt

df = pd.read_csv(filen,
                 index_col=None,
                 header=None,
                 parse_dates=[0],
                 date_parser=cached_date_parser)

Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.

与@fixxxer 的答案具有相同的优点，只需解析每个字符串一次，额外的好处是不必读取所有数据然后解析它。节省您的内存和处理时间。

Answer 4

回答by EdChum

No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:

无需指定 a date_parser，pandas 能够毫无困难地解析它，而且它会更快：

In [21]:

import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date    3 non-null datetime64[ns]
val     3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:

df
Out[22]:
        date    val
0 2012-06-08  12321
1 2013-06-08  12321
2 2014-03-08  12321

Answer 5

回答by Marcus Müller

Try the standard library:

试试标准库：

import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")

However, I don't really know if this is much faster than pandas.

但是，我真的不知道这是否比Pandas快得多。

Since your format is so simple, what about

既然你的格式这么简单，那怎么办

def parse(t):
     string_ = str(t)
     return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))

EDITyou say you need to take care of invalid data.

编辑您说您需要处理无效数据。

def parse(t):
     string_ = str(t)
     try:
         return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
     except:
         return default_datetime #you should define that somewhere else

All in all, I'm a bit conflicted about the validity of your problem:

总而言之，我对你的问题的有效性有点矛盾：

you need to be fast, but still you get your data from a CSV
you need to be fast, but still need to deal with invalid data

您需要快速，但您仍然可以从 CSV 文件中获取数据
你需要快，但仍然需要处理无效数据

That's kind of contradicting; my personal approach here would be assuming that your "huge" CSV just needs to be brought into a better-performing format once, and you either shouldn't care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data--there's so many formats that don't rely on string parsing.

这有点自相矛盾；我个人的方法是假设您的“巨大”CSV 只需要一次转换为性能更好的格式，并且您不应该关心转换过程的速度（因为它只发生一次），或者您应该带上任何产生 CSV 的东西来为您提供更好的数据——有很多格式不依赖于字符串解析。

Answer 6

回答by srikar saggurthi

If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.

如果您的日期时间具有 UTC 时间戳并且您只需要其中的一部分。将其转换为字符串，切片您需要的内容，然后应用以下内容以获得更快的访问速度。

created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC

df["date"]=  df["created_at"].apply(lambda x: str(x)[:10])


df["date"] = pd.to_datetime(df["date"])

Answer 7

回答by arod

I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:

我有一个大约 15 万行的 csv。在尝试了这篇文章中的几乎所有建议之后，我发现速度提高了 25%：

read the file row by row using Python3.7 native csv.reader
convert all 4 numeric columns using float()and
parse the date column with datetime.datetime.fromisoformat()

使用 Python3.7 native 逐行读取文件 csv.reader
使用float()和转换所有 4 个数字列
解析日期列 datetime.datetime.fromisoformat()

and Behold:

看哪：

finally convert the list to a DataFrame (!)**

最后将列表转换为 DataFrame (!)**

It baffles me how can this be faster than native pandas pd.read_csv(...)... can someone explain?

这让我感到困惑，这怎么能比本地大Pandas pd.read_csv(...) 更快……有人可以解释一下吗？

Answer 8

回答by ritchie46

Since pandas version 0.25the function pandas.read_csvaccepts a cache_dates=boolean(which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.

从pandas 0.25 版开始，该函数pandas.read_csv接受一个cache_dates=boolean（默认为True）关键字参数。因此，无需像已接受的答案中那样编写自己的缓存函数。

pandas 熊猫：缓慢的日期转换

提问by ppaulojr

回答by fixxxer

回答by Sam Davey

回答by firelynx

Streamlined date parsing with caching

使用缓存简化日期解析

回答by EdChum

回答by Marcus Müller

回答by srikar saggurthi

回答by arod

回答by ritchie46

相关推荐

最近更新

标签

pandas 熊猫：缓慢的日期转换

提问by ppaulojr

回答by fixxxer

回答by Sam Davey

回答by firelynx

Streamlined date parsing with caching

使用缓存简化日期解析

回答by EdChum

回答by Marcus Müller

回答by srikar saggurthi

回答by arod

回答by ritchie46

相关推荐

pandas 合并多个具有非唯一索引的数据帧

你能格式化 Pandas 整数来显示，比如用于浮点数的 `pd.options.display.float_format` 吗？

为什么 numpy/pandas 解析长行的 csv 文件这么慢？

Python：来自 dict 系列的 Pandas 数据框

相关推荐

最近更新

标签