使用 Pandas 将日期、时间和纳秒解析为日期时间对象

Question

提问by abudis

I have ASCIIfiles with a rather odd timestamp:

我有ASCII一个带有相当奇怪的时间戳的文件：

DATAH   DATE    TIME    SECONDS NANOSECONDS D
DATA    2012-06-04  23:49:15    1338853755  700000000   0.00855577
DATA    2012-06-04  23:49:15    1338853755  800000000   0.00805482
DATA    2012-06-04  23:49:15    1338853755  900000000   -0.00537284
DATA    2012-06-04  23:49:16    1338853756  0   -0.0239447

Basically the timestamp is divided into 4 columns - DATE, TIME, SECONDS and NANOSECONDS. I'd like to read the file as a pandasDataFramewith DATE, TIME and NANOSECONDS as datetimeobjects, which are used as index:

基本上时间戳分为 4 列 - DATE、TIME、SECONDS 和 NANOSECONDS。我想pandasDataFrame以 DATE、TIME 和 NANOSECONDS 作为datetime对象读取文件，用作索引：

import datetime as dt
import pandas as pd

parse = lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S %f')

df = pd.read_csv('data.txt', sep='\t', parse_dates=[['DATE', 'TIME', 'NANOSECONDS']], index_col=0, date_parser=parse)

But this fails, because nanoseconds values have 9 digits instead of 6 as required by the %f format. The above code works if I manually remove the 3 extra zeroes from the values in NANOSECONDS column. Could you please show me, how I can read-in the sample file as a pandasDataFrameobject using DATE, TIME and NANOSECONDS columns as index?

但这失败了，因为纳秒值有 9 位数字，而不是 %f 格式要求的 6 位数字。如果我从 NANOSECONDS 列中的值中手动删除 3 个额外的零，则上述代码有效。您能否告诉我，如何pandasDataFrame使用 DATE、TIME 和 NANOSECONDS 列作为索引将示例文件作为对象读入？

[UPDATE] Using %f000as suggested by behzad.nouriworks if NANOSECONDS column doesn't contain 0 values. So, apparently this is what's causing the problem now.

[ UPDATE]使用%f000所建议behzad.nouri作品如果纳秒列不包含0值。所以，显然这就是现在导致问题的原因。

Answer 1

回答by Jeff

This will be much faster that using the read_csv date parser to do this conversion.

这将比使用 read_csv 日期解析器进行此转换要快得多。

In [6]: data = """DATAH   DATE    TIME    SECONDS NANOSECONDS D
   ...: DATA    2012-06-04  23:49:15    1338853755  700000000   0.00855577
   ...: DATA    2012-06-04  23:49:15    1338853755  800000000   0.00805482
   ...: DATA    2012-06-04  23:49:15    1338853755  900000000   -0.00537284
   ...: DATA    2012-06-04  23:49:16    1338853756  0   -0.0239447"""

In [7]: df = read_csv(StringIO(data),sep='\s+')

In [8]: df
Out[8]: 
  DATAH        DATE      TIME     SECONDS  NANOSECONDS         D
0  DATA  2012-06-04  23:49:15  1338853755    700000000  0.008556
1  DATA  2012-06-04  23:49:15  1338853755    800000000  0.008055
2  DATA  2012-06-04  23:49:15  1338853755    900000000 -0.005373
3  DATA  2012-06-04  23:49:16  1338853756            0 -0.023945

[4 rows x 6 columns]

In [9]: df.dtypes
Out[9]: 
DATAH           object
DATE            object
TIME            object
SECONDS          int64
NANOSECONDS      int64
D              float64
dtype: object

In [13]: pd.to_datetime(df['SECONDS']+df['NANOSECONDS'].astype(float)/1e9, unit='s')
Out[13]: 
0   2012-06-04 23:49:15.700000
1   2012-06-04 23:49:15.800000
2   2012-06-04 23:49:15.900000
3          2012-06-04 23:49:16
dtype: datetime64[ns]

Answer 2

回答by behzad.nouri

try:

尝试：

parse = lambda x: dt.datetime.strptime(x + '0'*(29 - len(x)), '%Y-%m-%d %H:%M:%S %f000')

I think this:

我认为这：

def parse(t):
    import re
    t = re.sub('([0-9]*)$', lambda m: '0'*(9 - len(m.group(1))) + m.group(1), t)
    return dt.datetime.strptime(t[:-3], '%Y-%m-%d %H:%M:%S %f')

is safer because it appends zeros before the number; basically it is making sure the nanosecond value has 9 digits, and then drops the last 3;

更安全，因为它在数字前附加零；基本上是确保纳秒值有 9 位数字，然后删除最后 3 位；

使用 Pandas 将日期、时间和纳秒解析为日期时间对象

提问by abudis

回答by Jeff

回答by behzad.nouri

相关推荐

最近更新

标签

使用 Pandas 将日期、时间和纳秒解析为日期时间对象

提问by abudis

回答by Jeff

回答by behzad.nouri

相关推荐

pandas 熊猫将 dtype 对象转换为字符串

Pandas 如何将多个函数应用于数据框

如何将字符串转换为日期 Pandas python TypeError：strptime() 不接受关键字参数

pandas 如何子类化pandas DataFrame？

相关推荐

最近更新

标签