pandas 如何将列转换为熊猫中的一个日期时间列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19350806/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert columns into one datetime column in pandas?
提问by user1367204
I have a dataframe where the first 3 columns are 'MONTH', 'DAY', 'YEAR'
我有一个数据框,其中前 3 列是“MONTH”、“DAY”、“YEAR”
In each column there is an integer. Is there a Pythonic way to convert all three columns into datetimes while there are in the dataframe?
在每一列中都有一个整数。是否有一种 Pythonic 方法可以将所有三列转换为数据帧中的日期时间?
From:
从:
M D Y Apples Oranges
5 6 1990 12 3
5 7 1990 14 4
5 8 1990 15 34
5 9 1990 23 21
into:
进入:
Datetimes Apples Oranges
1990-6-5 12 3
1990-7-5 14 4
1990-8-5 15 34
1990-9-5 23 21
采纳答案by Jeff
In 0.13 (coming very soon), this is heavily optimized and quite fast (but still pretty fast in 0.12); both orders of magnitude faster than looping
在 0.13(即将推出)中,这经过了大量优化并且速度相当快(但在 0.12 中仍然相当快);两个数量级都比循环快
In [3]: df
Out[3]:
M D Y Apples Oranges
0 5 6 1990 12 3
1 5 7 1990 14 4
2 5 8 1990 15 34
3 5 9 1990 23 21
In [4]: df.dtypes
Out[4]:
M int64
D int64
Y int64
Apples int64
Oranges int64
dtype: object
# in 0.12, use this
In [5]: pd.to_datetime((df.Y*10000+df.M*100+df.D).apply(str),format='%Y%m%d')
# in 0.13 the above or this will work
In [5]: pd.to_datetime(df.Y*10000+df.M*100+df.D,format='%Y%m%d')
Out[5]:
0 1990-05-06 00:00:00
1 1990-05-07 00:00:00
2 1990-05-08 00:00:00
3 1990-05-09 00:00:00
dtype: datetime64[ns]
回答by jezrael
In version 0.18.1
you can use to_datetime
, but:
在版本中,0.18.1
您可以使用to_datetime
,但是:
- The names of the columns have to be
year
,month
,day
,hour
,minute
andsecond
: - Minimal columns are
year
,month
andday
- 该列的名称必须是
year
,month
,day
,hour
,minute
和second
: - 最小列是
year
,month
并且day
Sample:
样本:
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5],
'hour': [2, 3],
'minute': [10, 30],
'second': [21,25]})
print df
day hour minute month second year
0 4 2 10 2 21 2015
1 5 3 30 3 25 2016
print pd.to_datetime(df[['year', 'month', 'day']])
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
print pd.to_datetime(df[['year', 'month', 'day', 'hour']])
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
print pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
0 2015-02-04 02:10:00
1 2016-03-05 03:30:00
dtype: datetime64[ns]
print pd.to_datetime(df)
0 2015-02-04 02:10:21
1 2016-03-05 03:30:25
dtype: datetime64[ns]
Another solution is convert to dictionary
:
另一种解决方案是转换为dictionary
:
print df
M D Y Apples Oranges
0 5 6 1990 12 3
1 5 7 1990 14 4
2 5 8 1990 15 34
3 5 9 1990 23 21
print pd.to_datetime(dict(year=df.Y, month=df.M, day=df.D))
0 1990-05-06
1 1990-05-07
2 1990-05-08
3 1990-05-09
dtype: datetime64[ns]
回答by unutbu
Here is a alternative which uses NumPy datetime64 and timedelta64 arithmetic. It appears to be a bit faster for small DataFrames and much faster for larger DataFrames:
这是使用NumPy datetime64 和 timedelta64 算法的替代方法。对于小型 DataFrame 来说,它似乎要快一些,而对于较大的 DataFrame 来说,它似乎要快得多:
import numpy as np
import pandas as pd
df = pd.DataFrame({'M':[1,2,3,4], 'D':[6,7,8,9], 'Y':[1990,1991,1992,1993]})
# D M Y
# 0 6 1 1990
# 1 7 2 1991
# 2 8 3 1992
# 3 9 4 1993
y = np.array(df['Y']-1970, dtype='<M8[Y]')
m = np.array(df['M']-1, dtype='<m8[M]')
d = np.array(df['D']-1, dtype='<m8[D]')
dates2 = pd.Series(y+m+d)
# 0 1990-01-06
# 1 1991-02-07
# 2 1992-03-08
# 3 1993-04-09
# dtype: datetime64[ns]
In [214]: df = pd.concat([df]*1000)
In [215]: %timeit pd.to_datetime((df['Y']*10000+df['M']*100+df['D']).astype('int'), format='%Y%m%d')
100 loops, best of 3: 4.87 ms per loop
In [216]: %timeit pd.Series(np.array(df['Y']-1970, dtype='<M8[Y]')+np.array(df['M']-1, dtype='<m8[M]')+np.array(df['D']-1, dtype='<m8[D]'))
1000 loops, best of 3: 839 μs per loop
Here's a helper function to make this easier to use:
这是一个帮助函数,可以使其更易于使用:
def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)
In [437]: combine64(df['Y'], df['M'], df['D'])
Out[437]: array(['1990-01-06', '1991-02-07', '1992-03-08', '1993-04-09'], dtype='datetime64[D]')
回答by user1367204
I re-approached the problem and I think I found a solution. I initialized the csv file in the following way:
我重新解决了这个问题,我想我找到了解决方案。我通过以下方式初始化了 csv 文件:
pandas_object = DataFrame(read_csv('/Path/to/csv/file', parse_dates=True, index_col = [2,0,1] ))
Where the:
哪里:
index_col = [2,0,1]
represents the columns of the [year, month, day]
表示[年、月、日]的列
Only problem now is that now I have three new index columns, one represent the year, another the month, and another the day.
现在唯一的问题是,现在我有了三个新的索引列,一个代表年份,另一个代表月份,另一个代表日期。
回答by dolly singh
Even better way to do is as below:
更好的方法如下:
import pandas as pd
import datetime
dataset = pd.read_csv('dataset.csv')
date=dataset.apply(lambda x: datetime.date(int(x['Yr']), x['Mo'], x['Dy']),axis=1)
date = pd.to_datetime(date)
dataset = dataset.drop(columns=['Yr', 'Mo', 'Dy'])
dataset.insert(0, 'Date', date)
dataset.head()
回答by A.Kot
[pd.to_datetime(str(a)+str(b)+str(c),
format='%m%d%Y'
) for a,b,c in zip(df.M, df.D, df.Y)]
回答by Q-man
Convert the dataframe to strings for easy string concatenation:
将数据帧转换为字符串以便于字符串连接:
df=df.astype(str)
then convert to datetime, specify the format:
然后转换为日期时间,指定格式:
df.index=pd.to_datetime(df.Y+df.M+df.D,format="%Y%m%d")
which replaces the index rather than creating a new column.
它替换索引而不是创建新列。
回答by Dan
Let's assume you've got a dictionary foo
with each column of dates in parallel. If so, here's your one liner:
假设您有一本字典,foo
其中每一列日期都是并行的。如果是这样,这是你的一个班轮:
>>> from datetime import datetime
>>> foo = {"M": [1,2,3], "D":[30,30,21], "Y":[1980,1981,1982]}
>>>
>>> df = pd.DataFrame({"Datetime": [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]})
The real guts of it are this bit:
它的真正胆量是这样的:
>>> [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]
[datetime.datetime(1980, 1, 30, 0, 0), datetime.datetime(1981, 2, 28, 0, 0), datetime.datetime(1982, 3, 21, 0, 0)]
This is the sort of thing zip
was made for. It takes parallel lists and turns them into tuples. Then they get tuple unpacked (the for y,m,d in
bit) by the list comprehension there, then fed into the datetime
object constructor.
这就是那种东西zip
。它需要并行列表并将它们转换为元组。然后他们for y,m,d in
通过那里的列表理解来解压缩元组(位),然后输入到datetime
对象构造函数中。
pandas
seems happy with the datetime objects.
pandas
似乎对日期时间对象很满意。