使用列中的日期范围扩展 Pandas 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42151886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Expanding pandas data frame with date range in columns
提问by claybot
I have a pandas dataframe with dates and strings similar to this:
我有一个带有日期和字符串的 Pandas 数据框,类似于:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Startand Endcolumns and forward filling the data in Noteand Items:
我需要将其扩展/转换为以下内容,在Start和End列之间填充周数 (W-SAT)并在Note和Items 中向前填充数据:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
Whats the best way to do this with pandas? Some sort of multi-index apply?
用Pandas做到这一点的最佳方法是什么?某种多索引适用?
采纳答案by Ted Petrou
You can iterate over each row and create a new dataframe and then concatenate them together
您可以遍历每一行并创建一个新的数据框,然后将它们连接在一起
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B
回答by Gen
You don't need iteration at all.
你根本不需要迭代。
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
回答by jwdink
If the number of unique values of df['End'] - df['Start']
is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
如果 的唯一值的df['End'] - df['Start']
数量不是太大,但数据集中的行数很大,那么以下函数将比循环数据集快得多:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded