当 YYYYMMDD 和 HH 在单独的列中时,使用 Python 中的 Pandas 解析日期
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11615504/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse dates when YYYYMMDD and HH are in separate columns using pandas in Python
提问by Mauricio
I have a simple question related with csv files and parsing datetime.
我有一个与 csv 文件和解析日期时间相关的简单问题。
I have a csv file that look like this:
我有一个如下所示的 csv 文件:
YYYYMMDD, HH, X
20110101, 1, 10
20110101, 2, 20
20110101, 3, 30
I would like to read it using pandas (read_csv) and have it in a dataframe indexed by the datetime. So far I've tried to implement the following:
我想使用 Pandas (read_csv) 读取它并将其放在由日期时间索引的数据框中。到目前为止,我已经尝试实现以下内容:
import pandas as pnd
pnd.read_csv("..\file.csv", parse_dates = True, index_col = [0,1])
and the result I get is:
我得到的结果是:
X
YYYYMMDD HH
2011-01-01 2012-07-01 10
2012-07-02 20
2012-07-03 30
As you see the parse_dates in converting the HH into a different date.
正如您在将 HH 转换为不同日期时所看到的 parse_dates 一样。
Is there a simple and efficient way to combine properly the column "YYYYMMDD" with the column "HH" in order to have something like this? :
是否有一种简单有效的方法可以将“YYYYMMDD”列与“HH”列正确组合以得到这样的结果?:
X
Datetime
2011-01-01 01:00:00 10
2011-01-01 02:00:00 20
2011-01-01 03:00:00 30
Thanks in advance for the help.
在此先感谢您的帮助。
回答by Chang She
If you pass a list to index_col, it means you want to create a hierarchical index out of the columns in the list.
如果您将列表传递给index_col,则意味着您要从列表中的列中创建一个分层索引。
In addition, the parse_dateskeyword can be set to either True or a list/dict. If True, then it tries to parse individual columns as dates, otherwise it combines columns to parse a single date column.
此外,parse_dates关键字可以设置为 True 或列表/字典。如果为 True,则它会尝试将单个列解析为日期,否则它将组合列以解析单个日期列。
In summary, what you want to do is:
总之,你想做的是:
from datetime import datetime
import pandas as pd
parse = lambda x: datetime.strptime(x, '%Y%m%d %H')
pd.read_csv("..\file.csv", parse_dates = [['YYYYMMDD', 'HH']],
index_col = 0,
date_parser=parse)
回答by K.-Michael Aye
I am doing this all the time, so I tested different ways for speed. The fastest I found is the following, approx. 3 times faster than Chang She's solution, at least in my case, when taking the total time of file parsing and date parsing into account:
我一直在这样做,所以我测试了不同的速度方法。我发现最快的是以下,大约。考虑到文件解析和日期解析的总时间,至少在我的情况下,比 Chang She 的解决方案快 3 倍:
First, parse the data file using pd.read_csv withOUT parsing dates. I find that it is slowing down the file-reading quite a lot. Make sure that the columns of the CSV file are now columns in the dataframe df. Then:
首先,使用不解析日期的 pd.read_csv 解析数据文件。我发现它大大减慢了文件读取速度。确保 CSV 文件的列现在是数据框 df 中的列。然后:
format = "%Y%m%d %H"
times = pd.to_datetime(df.YYYYMMDD + ' ' + df.HH, format=format)
df.set_index(times, inplace=True)
# and maybe for cleanup
df = df.drop(['YYYYMMDD','HH'], axis=1)

