Pandas:从 3 列创建时间戳:月、日、小时

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26137946/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:32:00  来源:igfitidea点击:

Pandas: create timestamp from 3 columns: Month, Day, Hour

pythondatetimepandas

提问by Julien Marrec

I'm using Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3

我正在使用 Python 2.7、panda 0.14.1-2、numpy 1.8.1-1。我必须使用 Python 2.7,因为我将它与在 Python 3 上不起作用的东西结合起来

I'm trying to analyze a csv files that outputs Month, Day and Hour in separate columns, and would look something like the following:

我正在尝试分析在单独的列中输出月、日和小时的 csv 文件,并且看起来类似于以下内容:

Month Day Hour Value 1 1 1 105 1 1 2 30 1 1 3 85 1 1 4 52 1 1 5 65

Month Day Hour Value 1 1 1 105 1 1 2 30 1 1 3 85 1 1 4 52 1 1 5 65

I basically want to create a timestamp from those columns, and use "2005" as the year, and set this new timestamp column to be the index. I've read a lot of similar questions (hereand here) but they all rely on doing during read_csv(). I don't have a year column, so I don't think this applies to me (aside from loading dataframe, inserting column, writing, and redoing read_csv... seems convoluted).

我基本上想从这些列创建一个时间戳,并使用“2005”作为年份,并将这个新的时间戳列设置为索引。我读过很多类似的问题(这里这里),但它们都依赖于在 read_csv() 期间做。我没有年份列,所以我认为这不适用于我(除了加载数据框、插入列、写入和重做 read_csv ......似乎令人费解)。

After loading the dataframe, I insert a Year column in position 0 df.insert(0, "Year", 2005)

加载数据框后,我在位置 0 中插入一个 Year 列 df.insert(0, "Year", 2005)

So now I've got

所以现在我有

Year Month Day Hour Value 2005 1 1 1 105 2005 1 1 2 30 2005 1 1 3 85 2005 1 1 4 52 2005 1 1 5 65 df.types tells me that all columns are int64 types.

Year Month Day Hour Value 2005 1 1 1 105 2005 1 1 2 30 2005 1 1 3 85 2005 1 1 4 52 2005 1 1 5 65 df.types 告诉我所有的列都是 int64 类型。

Then I've tried doing this:

然后我试过这样做:

df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")

df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")

But I'm getting "TypeError: 'long' object is unsliceable"

但我收到“TypeError: ‘long’ object is unsliceable”

On the other hand, the following runs without errors.

另一方面,以下运行没有错误。

df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

As 2.7 doesn't like the %Y%M%d%H, as pointed by @EdChum, I've tried doing it in two steps: creating a datetime without hours, and adding the hours after. But: the output is not what I wanted

由于 2.7 不喜欢 %Y%M%d%H,正如@EdChum 所指出的,我尝试分两步进行:创建一个没有小时的日期时间,然后添加小时。但是:输出不是我想要的

In [1]: # Do it without hours first (otherwise doesn't work in Python 2.7)
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

In [2]: df['Datetime']
Out [2]:
0    2005-01-01 00:01:00
1    2005-01-01 00:01:00
...
13   2005-01-01 00:01:00
14   2005-01-01 00:01:00
...
8745   2005-01-31 00:12:00
8746   2005-01-31 00:12:00
...
8758   2005-01-31 00:12:00
8759   2005-01-31 00:12:00

8758 is supposed to be 2005-12-31 for example. What is wrong with that?

例如,8758 应该是 2005-12-31。这有什么问题?

Once I resolve that, I'll be able to re-add the hours:

一旦我解决了这个问题,我就可以重新添加时间:

In [3]: # Then add the hours
df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")

回答by Joop

Letting the pandas parser do the heavy lifting (as in first answer) is obviously the best option if you are getting it from csv. If you are getting or calculating numbers in a different way try:

如果您是从 csv 获取的,让 Pandas 解析器完成繁重的工作(如第一个答案)显然是最好的选择。如果您以不同的方式获取或计算数字,请尝试:

df['DateTime'] = df[['Year', 'Month', 'Day', 'Hour']].apply(lambda s : datetime.datetime(*s),axis = 1)

find that is still easy to read and very flexible.

发现它仍然易于阅读且非常灵活。

回答by jfs

You could parse the input text in your question using pandas.read_csv():

您可以使用以下方法解析问题中的输入文本pandas.read_csv()

#!/usr/bin/env python
from datetime import datetime
import pandas as pd

print(pd.read_csv(
    'input.txt', sep=r'\s+', parse_dates=[[0, 1, 2]],
    date_parser=lambda *columns: datetime(2005, *map(int, columns)),
    index_col=0))

Output

输出

                     Value
Month_Day_Hour            
2005-01-01 01:00:00    105
2005-01-01 02:00:00     30
2005-01-01 03:00:00     85
2005-01-01 04:00:00     52
2005-01-01 05:00:00     65