如何重新排列 python pandas 数据框?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15432659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:42:36  来源:igfitidea点击:

How to rearrange a python pandas dataframe?

pythonrowpandassequencedataframe

提问by Markus W

I have the following dataframe read in from a .csv file with the "Date" column being the index. The days are in the rows and the columns show the values for the hours that day.

我从 .csv 文件中读取了以下数据框,其中“日期”列是索引。天在行中,列显示当天小时数的值。

> Date           h1 h2  h3  h4 ... h24
> 14.03.2013    60  50  52  49 ... 73

I would like to arrange it like this, so that there is one index column with the date/time and one column with the values in a sequence

我想这样安排它,以便有一个带有日期/时间的索引列和一个带有序列值的列

>Date/Time            Value
>14.03.2013 00:00:00  60
>14.03.2013 01:00:00  50
>14.03.2013 02:00:00  52
>14.03.2013 03:00:00  49
>.
>.
>.
>14.03.2013 23:00:00  73

I was trying it by using two loops to go through the dataframe. Is there an easier way to do this in pandas?

我通过使用两个循环来遍历数据框来尝试它。有没有更简单的方法在Pandas中做到这一点?

回答by DSM

I'm not the best at date manipulations, but maybe something like this:

我不是最擅长日期操作,但也许是这样的:

import pandas as pd
from datetime import timedelta

df = pd.read_csv("hourmelt.csv", sep=r"\s+")

df = pd.melt(df, id_vars=["Date"])
df = df.rename(columns={'variable': 'hour'})
df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)

combined = df.apply(lambda x: 
                    pd.to_datetime(x['Date'], dayfirst=True) + 
                    timedelta(hours=int(x['hour'])), axis=1)

df['Date'] = combined
del df['hour']

df = df.sort("Date")


Some explanation follows.

一些解释如下。

Starting from

从...开始

>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> 
>>> df = pd.read_csv("hourmelt.csv", sep=r"\s+")
>>> df
         Date  h1  h2  h3  h4  h24
0  14.03.2013  60  50  52  49   73
1  14.04.2013   5   6   7   8    9

We can use pd.meltto make the hour columns into one column with that value:

我们可以使用pd.melt将小时列变成具有该值的一列:

>>> df = pd.melt(df, id_vars=["Date"])
>>> df = df.rename(columns={'variable': 'hour'})
>>> df
         Date hour  value
0  14.03.2013   h1     60
1  14.04.2013   h1      5
2  14.03.2013   h2     50
3  14.04.2013   h2      6
4  14.03.2013   h3     52
5  14.04.2013   h3      7
6  14.03.2013   h4     49
7  14.04.2013   h4      8
8  14.03.2013  h24     73
9  14.04.2013  h24      9

Get rid of those hs:

摆脱那些h

>>> df['hour'] = df['hour'].apply(lambda x: int(x.lstrip('h'))-1)
>>> df
         Date  hour  value
0  14.03.2013     0     60
1  14.04.2013     0      5
2  14.03.2013     1     50
3  14.04.2013     1      6
4  14.03.2013     2     52
5  14.04.2013     2      7
6  14.03.2013     3     49
7  14.04.2013     3      8
8  14.03.2013    23     73
9  14.04.2013    23      9

Combine the two columns as a date:

将两列合并为日期:

>>> combined = df.apply(lambda x: pd.to_datetime(x['Date'], dayfirst=True) + timedelta(hours=int(x['hour'])), axis=1)
>>> combined
0    2013-03-14 00:00:00
1    2013-04-14 00:00:00
2    2013-03-14 01:00:00
3    2013-04-14 01:00:00
4    2013-03-14 02:00:00
5    2013-04-14 02:00:00
6    2013-03-14 03:00:00
7    2013-04-14 03:00:00
8    2013-03-14 23:00:00
9    2013-04-14 23:00:00

Reassemble and clean up:

重新组装和清理:

>>> df['Date'] = combined
>>> del df['hour']
>>> df = df.sort("Date")
>>> df
                 Date  value
0 2013-03-14 00:00:00     60
2 2013-03-14 01:00:00     50
4 2013-03-14 02:00:00     52
6 2013-03-14 03:00:00     49
8 2013-03-14 23:00:00     73
1 2013-04-14 00:00:00      5
3 2013-04-14 01:00:00      6
5 2013-04-14 02:00:00      7
7 2013-04-14 03:00:00      8
9 2013-04-14 23:00:00      9

回答by Dale Jung

You could always grab the hourly data_array and flatten it. You would generate a new DatetimeIndex with hourly freq.

您总是可以获取每小时的 data_array 并将其展平。您将使用每小时频率生成一个新的 DatetimeIndex。

df = df.asfreq('D')
hourly_data = df.values[:, :]
new_ind = pd.date_range(start=df.index[0], freq="H", periods=len(df) * 24)
# create Series.
s = pd.Series(hourly_data.flatten(), index=new_ind)

I'm assuming that read_csv is parsing the 'Date' column and making it the index. We change to frequency of 'D' so that the new_indlines up correctly if you have missing days. The missing days will be filled with np.nanwhich you can drop with s.dropna().

我假设 read_csv 正在解析“日期”列并将其作为索引。我们更改为“D”的频率,以便在new_ind您错过天数时正确排列。缺少的日子将充满np.nan您可以放弃的日子s.dropna()

notebook link

笔记本链接