Pandas - 根据日期将数据帧拆分为多个数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35907421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Split dataframe into multiple dataframes based on dates?
提问by Alex F
I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.
我有一个包含多个列和日期列的数据框。日期格式为 12/31/15,我已将其设置为日期时间对象。
I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.
我将日期时间列设置为索引,并希望对数据框的每个月执行回归计算。
I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.
我相信这样做的方法是根据月份将数据帧拆分为多个数据帧,存储到数据帧列表中,然后对列表中的每个数据帧执行回归。
I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.
我已经使用 groupby 成功地按月拆分数据帧,但我不确定如何正确地将 groupby 对象中的每个组转换为数据帧,以便能够在其上运行我的回归函数。
Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?
有谁知道如何根据日期将数据帧拆分为多个数据帧,或者是解决我的问题的更好方法?
Here is my code I've written so far
这是我到目前为止编写的代码
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
# Group dataframe on index by month and year
# Groupby works, but dmatrices does not
for df_group in df.groupby(pd.TimeGrouper("M")):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
回答by daedalus
If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby
object:
如果必须循环,则需要在迭代groupby
对象时解压键和数据帧:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
Note the use of group_name
here:
注意group_name
这里的使用:
for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you want to avoid iteration, do have a look at the notebook in Paul H's gist(see his comment), but a simple example of using apply
would be:
如果您想避免迭代,请查看Paul H 的要点中的笔记本(请参阅他的评论),但一个简单的使用示例apply
是:
def do_regression(df_group, ret='outcome'):
"""Apply the function to each group in the data and return one result."""
y,X = dmatrices('value1 ~ value2 + value3',
data=df_group,
return_type='dataframe')
if ret == 'outcome':
return y
else:
return X
outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')
回答by Pjl
This is a split per year.
这是每年的拆分。
import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x))
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
#this simple save the data
datep = str(year).replace('-', '')
filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)