使用列中的日期范围扩展 Pandas 数据框

Question

提问by claybot

I have a pandas dataframe with dates and strings similar to this:

我有一个带有日期和字符串的 Pandas 数据框，类似于：

Start        End           Note    Item
2016-10-22   2016-11-05    Z       A
2017-02-11   2017-02-25    W       B

I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Startand Endcolumns and forward filling the data in Noteand Items:

我需要将其扩展/转换为以下内容，在Start和End列之间填充周数 (W-SAT)并在Note和Items 中向前填充数据：

Start        Note    Item
2016-10-22   Z       A
2016-10-29   Z       A
2016-11-05   Z       A
2017-02-11   W       B
2017-02-18   W       B
2017-02-25   W       B

Whats the best way to do this with pandas? Some sort of multi-index apply?

用Pandas做到这一点的最佳方法是什么？某种多索引适用？

Answer 1

采纳答案by Ted Petrou

You can iterate over each row and create a new dataframe and then concatenate them together

您可以遍历每一行并创建一个新的数据框，然后将它们连接在一起

pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
               'Note': row.Note,
               'Item': row.Item}, columns=['Start', 'Note', 'Item']) 
           for i, row in df.iterrows()], ignore_index=True)

       Start Note Item
0 2016-10-22    Z    A
1 2016-10-29    Z    A
2 2016-11-05    Z    A
3 2017-02-11    W    B
4 2017-02-18    W    B
5 2017-02-25    W    B

Answer 2

回答by Gen

You don't need iteration at all.

你根本不需要迭代。

df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')

df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()

Answer 3

回答by jwdink

If the number of unique values of df['End'] - df['Start']is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:

如果的唯一值的df['End'] - df['Start']数量不是太大，但数据集中的行数很大，那么以下函数将比循环数据集快得多：

def date_expander(dataframe: pd.DataFrame,
                  start_dt_colname: str,
                  end_dt_colname: str,
                  time_unit: str,
                  new_colname: str,
                  end_inclusive: bool) -> pd.DataFrame:
    td = pd.Timedelta(1, time_unit)

    # add a timediff column:
    dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]

    # get the maximum timediff:
    max_diff = int((dataframe['_dt_diff'] / td).max())

    # for each possible timediff, get the intermediate time-differences:
    df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
                          for dt_diff in range(max_diff + 1)])

    # join to the original dataframe
    data_expanded = dataframe.merge(df_diffs, on='_dt_diff')

    # the new dt column is just start plus the intermediate diffs:
    data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']

    # remove start-end cols, as well as temp cols used for calculations:
    to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
    if new_colname in to_drop:
        to_drop.remove(new_colname)
    data_expanded = data_expanded.drop(columns=to_drop)

    # don't modify dataframe in place:
    del dataframe['_dt_diff']

    return data_expanded

使用列中的日期范围扩展 Pandas 数据框

提问by claybot

采纳答案by Ted Petrou

回答by Gen

回答by jwdink

相关推荐

最近更新

标签

使用列中的日期范围扩展 Pandas 数据框

提问by claybot

采纳答案by Ted Petrou

回答by Gen

回答by jwdink

相关推荐

pandas 根据同一行的其他列中的值将函数应用于数据框列元素？

pandas 熊猫按多列排名

pandas 无法将 nan 转换为 int（但没有 nan）

在 Pandas 中将两个 MultiIndex 级别合并为一个

相关推荐

最近更新

标签