使用 Pandas 将日期、时间和纳秒解析为日期时间对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22405252/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse date, time and nanoseconds as datetime objects using pandas
提问by abudis
I have ASCIIfiles with a rather odd timestamp:
我有ASCII一个带有相当奇怪的时间戳的文件:
DATAH   DATE    TIME    SECONDS NANOSECONDS D
DATA    2012-06-04  23:49:15    1338853755  700000000   0.00855577
DATA    2012-06-04  23:49:15    1338853755  800000000   0.00805482
DATA    2012-06-04  23:49:15    1338853755  900000000   -0.00537284
DATA    2012-06-04  23:49:16    1338853756  0   -0.0239447
Basically the timestamp is divided into 4 columns - DATE, TIME, SECONDS and NANOSECONDS.
I'd like to read the file as a pandasDataFramewith DATE, TIME and NANOSECONDS as datetimeobjects, which are used as index:
基本上时间戳分为 4 列 - DATE、TIME、SECONDS 和 NANOSECONDS。我想pandasDataFrame以 DATE、TIME 和 NANOSECONDS 作为datetime对象读取文件,用作索引:
import datetime as dt
import pandas as pd
parse = lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S %f')
df = pd.read_csv('data.txt', sep='\t', parse_dates=[['DATE', 'TIME', 'NANOSECONDS']], index_col=0, date_parser=parse)
But this fails, because nanoseconds values have 9 digits instead of 6 as required by the %f format. The above code works if I manually remove the 3 extra zeroes from the values in NANOSECONDS column.
Could you please show me, how I can read-in the sample file as a pandasDataFrameobject using DATE, TIME and NANOSECONDS columns as index?
但这失败了,因为纳秒值有 9 位数字,而不是 %f 格式要求的 6 位数字。如果我从 NANOSECONDS 列中的值中手动删除 3 个额外的零,则上述代码有效。您能否告诉我,如何pandasDataFrame使用 DATE、TIME 和 NANOSECONDS 列作为索引将示例文件作为对象读入?
[UPDATE] Using %f000as suggested by behzad.nouriworks if NANOSECONDS column doesn't contain 0 values. So, apparently this is what's causing the problem now.
[ UPDATE]使用%f000所建议behzad.nouri作品如果纳秒列不包含0值。所以,显然这就是现在导致问题的原因。
回答by Jeff
This will be much faster that using the read_csv date parser to do this conversion.
这将比使用 read_csv 日期解析器进行此转换要快得多。
In [6]: data = """DATAH   DATE    TIME    SECONDS NANOSECONDS D
   ...: DATA    2012-06-04  23:49:15    1338853755  700000000   0.00855577
   ...: DATA    2012-06-04  23:49:15    1338853755  800000000   0.00805482
   ...: DATA    2012-06-04  23:49:15    1338853755  900000000   -0.00537284
   ...: DATA    2012-06-04  23:49:16    1338853756  0   -0.0239447"""
In [7]: df = read_csv(StringIO(data),sep='\s+')
In [8]: df
Out[8]: 
  DATAH        DATE      TIME     SECONDS  NANOSECONDS         D
0  DATA  2012-06-04  23:49:15  1338853755    700000000  0.008556
1  DATA  2012-06-04  23:49:15  1338853755    800000000  0.008055
2  DATA  2012-06-04  23:49:15  1338853755    900000000 -0.005373
3  DATA  2012-06-04  23:49:16  1338853756            0 -0.023945
[4 rows x 6 columns]
In [9]: df.dtypes
Out[9]: 
DATAH           object
DATE            object
TIME            object
SECONDS          int64
NANOSECONDS      int64
D              float64
dtype: object
In [13]: pd.to_datetime(df['SECONDS']+df['NANOSECONDS'].astype(float)/1e9, unit='s')
Out[13]: 
0   2012-06-04 23:49:15.700000
1   2012-06-04 23:49:15.800000
2   2012-06-04 23:49:15.900000
3          2012-06-04 23:49:16
dtype: datetime64[ns]
回答by behzad.nouri
try:
尝试:
parse = lambda x: dt.datetime.strptime(x + '0'*(29 - len(x)), '%Y-%m-%d %H:%M:%S %f000')
I think this:
我认为这:
def parse(t):
    import re
    t = re.sub('([0-9]*)$', lambda m: '0'*(9 - len(m.group(1))) + m.group(1), t)
    return dt.datetime.strptime(t[:-3], '%Y-%m-%d %H:%M:%S %f')
is safer because it appends zeros before the number; basically it is making sure the nanosecond value has 9 digits, and then drops the last 3;
更安全,因为它在数字前附加零;基本上是确保纳秒值有 9 位数字,然后删除最后 3 位;

