Pandas：从 3 列创建时间戳：月、日、小时

Question

提问by Julien Marrec

I'm using Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3

我正在使用 Python 2.7、panda 0.14.1-2、numpy 1.8.1-1。我必须使用 Python 2.7，因为我将它与在 Python 3 上不起作用的东西结合起来

I'm trying to analyze a csv files that outputs Month, Day and Hour in separate columns, and would look something like the following:

我正在尝试分析在单独的列中输出月、日和小时的 csv 文件，并且看起来类似于以下内容：

Month Day Hour Value 1 1 1 105 1 1 2 30 1 1 3 85 1 1 4 52 1 1 5 65

I basically want to create a timestamp from those columns, and use "2005" as the year, and set this new timestamp column to be the index. I've read a lot of similar questions (hereand here) but they all rely on doing during read_csv(). I don't have a year column, so I don't think this applies to me (aside from loading dataframe, inserting column, writing, and redoing read_csv... seems convoluted).

我基本上想从这些列创建一个时间戳，并使用“2005”作为年份，并将这个新的时间戳列设置为索引。我读过很多类似的问题（这里和这里），但它们都依赖于在 read_csv() 期间做。我没有年份列，所以我认为这不适用于我（除了加载数据框、插入列、写入和重做 read_csv ......似乎令人费解）。

After loading the dataframe, I insert a Year column in position 0 df.insert(0, "Year", 2005)

加载数据框后，我在位置 0 中插入一个 Year 列 df.insert(0, "Year", 2005)

So now I've got

所以现在我有

Year Month Day Hour Value 2005 1 1 1 105 2005 1 1 2 30 2005 1 1 3 85 2005 1 1 4 52 2005 1 1 5 65df.types tells me that all columns are int64 types.

Year Month Day Hour Value 2005 1 1 1 105 2005 1 1 2 30 2005 1 1 3 85 2005 1 1 4 52 2005 1 1 5 65df.types 告诉我所有的列都是 int64 类型。

Then I've tried doing this:

然后我试过这样做：

df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")

But I'm getting "TypeError: 'long' object is unsliceable"

但我收到“TypeError: ‘long’ object is unsliceable”

On the other hand, the following runs without errors.

另一方面，以下运行没有错误。

df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

As 2.7 doesn't like the %Y%M%d%H, as pointed by @EdChum, I've tried doing it in two steps: creating a datetime without hours, and adding the hours after. But: the output is not what I wanted

由于 2.7 不喜欢 %Y%M%d%H，正如@EdChum 所指出的，我尝试分两步进行：创建一个没有小时的日期时间，然后添加小时。但是：输出不是我想要的

In [1]: # Do it without hours first (otherwise doesn't work in Python 2.7)
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

In [2]: df['Datetime']
Out [2]:
0    2005-01-01 00:01:00
1    2005-01-01 00:01:00
...
13   2005-01-01 00:01:00
14   2005-01-01 00:01:00
...
8745   2005-01-31 00:12:00
8746   2005-01-31 00:12:00
...
8758   2005-01-31 00:12:00
8759   2005-01-31 00:12:00

8758 is supposed to be 2005-12-31 for example. What is wrong with that?

例如，8758 应该是 2005-12-31。这有什么问题？

Once I resolve that, I'll be able to re-add the hours:

一旦我解决了这个问题，我就可以重新添加时间：

In [3]: # Then add the hours
df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")

Answer 1

回答by Joop

Letting the pandas parser do the heavy lifting (as in first answer) is obviously the best option if you are getting it from csv. If you are getting or calculating numbers in a different way try:

如果您是从 csv 获取的，让 Pandas 解析器完成繁重的工作（如第一个答案）显然是最好的选择。如果您以不同的方式获取或计算数字，请尝试：

df['DateTime'] = df[['Year', 'Month', 'Day', 'Hour']].apply(lambda s : datetime.datetime(*s),axis = 1)

find that is still easy to read and very flexible.

发现它仍然易于阅读且非常灵活。

Answer 2

回答by jfs

You could parse the input text in your question using pandas.read_csv():

您可以使用以下方法解析问题中的输入文本pandas.read_csv()：

#!/usr/bin/env python
from datetime import datetime
import pandas as pd

print(pd.read_csv(
    'input.txt', sep=r'\s+', parse_dates=[[0, 1, 2]],
    date_parser=lambda *columns: datetime(2005, *map(int, columns)),
    index_col=0))

Output

输出

                     Value
Month_Day_Hour            
2005-01-01 01:00:00    105
2005-01-01 02:00:00     30
2005-01-01 03:00:00     85
2005-01-01 04:00:00     52
2005-01-01 05:00:00     65

Pandas：从 3 列创建时间戳：月、日、小时

提问by Julien Marrec

回答by Joop

回答by jfs

Output

输出

相关推荐

最近更新

标签

Pandas：从 3 列创建时间戳：月、日、小时

提问by Julien Marrec

回答by Joop

回答by jfs

Output

输出

相关推荐

Excel 输出中的 Python Pandas 自定义时间格式

在列表中的字符串中查找最后一个单词（Pandas，Python 3）

缺失数据，在 Pandas 中插入行并用 NAN 填充

为什么 Pandas 默认遍历 DataFrame 列？

相关推荐

最近更新

标签