如何让 pandas.read_csv() 从 CSV 文件列推断日期时间和时间增量类型?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21906715/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get pandas.read_csv() to infer datetime and timedelta types from CSV file columns?
提问by Sean Hammond
pandas.read_csv()infers the types of columns, but I can't get it to infer any datetime or timedelta type (e.g. datetime64, timedelta64) for columns whose values seem like obvious datetimes and time deltas.
pandas.read_csv()推断列的类型,但我无法让它为值看起来像明显的日期时间和时间增量的列推断任何日期时间或时间增量类型(例如datetime64,timedelta64)。
Here's an example CSV file:
这是一个示例 CSV 文件:
datetime,timedelta,integer,number,boolean,string
20111230 00:00:00,one hour,10,1.6,True,Foobar
And some code to read it with pandas:
以及一些用Pandas读取它的代码:
dataframe = pandas.read_csv(path)
The types of the columns on that dataframe come out as object, object, int, float, bool, object. They're all as I would expect except the first two columns, which I want to be datetime and timedelta.
该数据帧上的列的类型以 object、object、int、float、bool、object 的形式出现。除了前两列,我希望它们是 datetime 和 timedelta,它们都和我所期望的一样。
Is it possible to get pandas to automatically detect datetime and timedelta columns?
是否可以让 Pandas 自动检测 datetime 和 timedelta 列?
(I don't want to have to tell pandas which columns are datetimes and timedeltas or tell it the formats, I want it to try and detect them automatically like it does for into, float and bool columns.)
(我不想告诉 Pandas 哪些列是日期时间和时间增量,或者告诉它格式,我希望它像处理 into、float 和 bool 列一样尝试自动检测它们。)
回答by EdChum
One thing you can do is define your date parser using strptime, this will handle your date format, this isn't automatic though:
您可以做的一件事是使用 定义日期解析器strptime,这将处理您的日期格式,但这不是自动的:
In [59]:
import pandas as pd
import datetime as dt
def parse_dates(x):
    return dt.datetime.strptime(x, '%Y%m%d %H:%M:%S')
# dict for word lookup, conversion
word_to_int={'zero':0,
     'one':1,
     'two':2,
     'three':3,
     'four':4,
     'five':5,
     'six':6,
     'seven':7,
     'eight':8,
     'nine':9}
def str_to_time_delta(x):
    num = 0
    if 'hour' in x.lower():
        num = x[0:x.find(' ')].lower()
    return dt.timedelta( hours = word_to_int[num])
df = pd.read_csv(r'c:\temp1.txt', parse_dates=[0],date_parser=parse_dates)
df.dtypes
Out[59]:
datetime     datetime64[ns]
timedelta            object
integer               int64
number              float64
boolean                bool
string               object
dtype: object
In [60]:
Then to convert to timedeltas use the dict and function to parse and convert to timedeltas
然后转换为 timedeltas 使用 dict 和函数来解析并转换为 timedeltas
df['timedelta'] = df['timedelta'].map(str_to_time_delta)
In [61]:
df.dtypes
Out[61]:
datetime      datetime64[ns]
timedelta    timedelta64[ns]
integer                int64
number               float64
boolean                 bool
string                object
dtype: object
In [62]:
df
Out[62]:
             datetime  timedelta  integer  number boolean  string
0 2011-12-30 00:00:00   01:00:00       10     1.6    True  Foobar
[1 rows x 6 columns]
To answer your principal question I don't know of a way to automatically do this.
要回答您的主要问题,我不知道自动执行此操作的方法。
EDIT
编辑
Instead of my convoluted mapping function you can do just this:
您可以这样做,而不是我复杂的映射函数:
df['timedelta'] = pd.to_timedelta(df['timedelta'])
Further edit
进一步编辑
As noted by @Jeff you can do this instead of using strptimewhen reading the csv (in version 0.13.1 and above though):
正如@Jeff 所指出的,您可以strptime在读取 csv 时执行此操作而不是使用(尽管在 0.13.1 及更高版本中):
df = pd.read_csv(r'c:\temp1.txt', parse_dates=[0], infer_datetime_format=True)

