Pandas:使用 read_csv 解析不同列中的日期

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45090567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:59:44  来源:igfitidea点击:

Pandas: Parsing dates in different columns with read_csv

pythonpandasparsingdatetimedataframe

提问by Arda Arslan

I have an ascii file where the dates are formatted as follows:

我有一个 ascii 文件,其中的日期格式如下:

Jan 20 2015 00:00:00.000
Jan 20 2015 00:10:00.000
Jan 20 2015 00:20:00.000
Jan 20 2015 00:30:00.000
Jan 20 2015 00:40:00.000

When loading the file into pandas, each column above gets its own column in a pandas dataframe. I've tried the variations of the following:

将文件加载到 Pandas 时,上面的每一列在 Pandas 数据框中都有自己的列。我尝试了以下变体:

from pandas import read_csv
from datetime import datetime

df = read_csv('file.txt', header=None, delim_whitespace=True,
              parse_dates={'datetime': [0, 1, 2, 3]},
              date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H %M %S'))

I get a couple errors:

我收到几个错误:

TypeError: <lambda>() takes 1 positional argument but 4 were given
ValueError: time data 'Jun 29 2017 00:35:00.000' does not match format '%b %d %Y %H %M %S'

I'm confused because:

我很困惑,因为:

  1. I'm passing a dict to parse_datesto parse the different columns as a single date.
  2. I'm using: %b- abbreviated month name, %d- day of the month, %Yyear with century, %H24-hour, %M- minute, and %S- second
  1. 我正在传递一个 dict 来parse_dates将不同的列解析为单个日期。
  2. 我正在使用:%b- 缩写的月份名称,%d- 月份中的日期,%Y带有世纪的年份,%H24 小时,%M- 分和%S- 秒

Anyone see what I'm doing incorrectly?

有人看到我做错了什么吗?

Edit:

编辑:

I've tried date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S')which returns ValueError: unconverted data remains: .000

我试过date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S')哪个返回ValueError: unconverted data remains: .000

Edit 2:

编辑2:

I tried what @MaxU suggested in his update, but it was problematic because my original data is formatted like the following:

我尝试了@MaxU 在他的更新中建议的内容,但有问题,因为我的原始数据格式如下:

Jan   1  2017  00:00:00.000   123 456 789 111 222 333 

I'm only interested in the first 7 columns so I import my file with the following:

我只对前 7 列感兴趣,所以我使用以下内容导入我的文件:

df = read_csv(fn, header=None, delim_whitespace=True, usecols=[0, 1, 2, 3, 4, 5, 6])

Then to create a column with datetime information from the first 4 columns I try:

然后从前 4 列创建一个包含日期时间信息的列,我尝试:

df['datetime'] = to_datetime(df.ix[:, :3], format='%b %d %Y %H:%M:%S.%f')

However this doesn't work because to_datetimeexpects "integer, float, string, datetime, list, tuple, 1-d array, Series" as the first argument and df.ix[:, :3]returns a dataframe with the following format:

但是这不起作用,因为to_datetime需要“整数、浮点数、字符串、日期时间、列表、元组、一维数组、系列”作为第一个参数并df.ix[:, :3]返回具有以下格式的数据帧:

         0   1     2             3
0      Jan   1  2017  00:00:00.000

How do I feed in every row of the first four columns to to_datetimesuch that I get one column of datetimes?

如何在前四列的每一行中输入to_datetime,以便获得一列datetimes

Edit 3:

编辑3:

I think I solved my second problem. I just use to following command and do everything when I read my file in (I was basically just missing %fto parse past seconds):

我想我解决了我的第二个问题。我只是习惯于遵循命令并在读取文件时执行所有操作(我基本上只是错过%f了解析过去几秒钟的时间):

df = read_csv(fileName, header=None, delim_whitespace=True,
              parse_dates={'datetime': [0, 1, 2, 3]},
              date_parser=lambda x: datetime.strptime(x, '%b %d %Y %H:%M:%S.%f'),
              usecols=[0, 1, 2, 3, 4, 5, 6])

The whole reason I wanted to parse manually instead of letting pandas handle it like @MaxU suggested was to see if manually feeding in instructions would be faster - and it is! From my tests the snippet above runs approximately 5-6 times faster than letting pandas infer parsing for you.

我想手动解析而不是让Pandas像@MaxU 建议的那样处理它的全部原因是看看手动输入指令是否会更快 - 确实如此!从我的测试来看,上面的代码片段比让 Pandas 为您推断解析的运行速度大约快 5-6 倍。

采纳答案by MaxU

Pandas (tested with version 0.20.1) is smart enough to do it for you:

Pandas(使用 0.20.1 版测试)足够聪明,可以为您完成:

In [4]: pd.read_csv(fn, sep='\s+', parse_dates={'datetime': [0, 1, 2, 3]})
Out[4]:
             datetime
0 2015-01-20 00:10:00
1 2015-01-20 00:20:00
2 2015-01-20 00:30:00
3 2015-01-20 00:40:00

UPDATE:if all entries have the same format you can try to do it this way:

更新:如果所有条目都具有相同的格式,您可以尝试这样做:

df = pd.read_csv(fn, sep='~', names=['datetime'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%b %d %Y %H:%M:%S.%f')

回答by Diego Aguado

Have a go to this simpler approach:

试试这个更简单的方法:

df = pandas.read_csv('file.txt')
df.columns = ['date']

dfshould be a dataframe with a single column. After that try casting that column to datetime

df应该是具有单列的数据框。之后尝试将该列转换为日期时间

df['date'] = pd.to_datetime(df['date'])