Python 如何使用 Pandas 数据框创建滞后数据结构

Question

提问by Mannaggia

Example

例子

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s 
1    5
2    4
3    3
4    2
5    1

Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)

有没有一种有效的方法来创建一个系列。例如，在每一行中包含滞后值（在本例中最多滞后 2）

3    [3, 4, 5]
4    [2, 3, 4]
5    [1, 2, 3]

This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])

这对应于s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])

How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?

对于具有很多很长的时间序列的数据帧，如何以有效的方式完成此操作？

Thanks

谢谢

Edited after seeing the answers

看到答案后修改

ok, at the end I implemented this function:

好的，最后我实现了这个功能：

def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
    new_dict={}
    for col_name in s:
        new_dict[col_name]=s[col_name]
        # create lagged Series
        for l in range(1,lag+1):
            new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
    res=pd.DataFrame(new_dict,index=s.index)

elif type(s) is pd.Series:
    the_range=range(lag+1)
    res=pd.concat([s.shift(i) for i in the_range],axis=1)
    res.columns=['lag_%d' %i for i in the_range]
else:
    print 'Only works for DataFrame or Series'
    return None
if dropna:
    return res.dropna()
else:
    return res

it produces the wished outputs and manages the naming of columns in the resulting DataFrame.

它产生所需的输出并管理结果 DataFrame 中列的命名。

For a Series as input:

对于作为输入的系列：

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
   lag_0  lag_1  lag_2
1      5    NaN    NaN
2      4      5    NaN
3      3      4      5
4      2      3      4
5      1      2      3

and for a DataFrame as input:

并将 DataFrame 作为输入：

s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)

   a  a_lag1  a_lag2   b  b_lag1  b_lag2
3  3       4       5  30      40      50
4  2       3       4  20      30      40
5  1       2       3  10      20      30

Answer 1

采纳答案by Andy Hayden

As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.

如前所述，可能值得研究rollback_functions，这意味着您不会有那么多副本。

One solution is to concat shiftedSeries together to make a DataFrame:

一个解决方案是CONCAT 转移系列在一起，使数据帧：

In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]: 
   0   1   2
1  5 NaN NaN
2  4   5 NaN
3  3   4   5
4  2   3   4
5  1   2   3

In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]: 
   0  1  2
3  3  4  5
4  2  3  4
5  1  2  3

Doing work on this will be more efficient that on lists...

做这方面的工作会比在列表上更有效率......

Answer 2

回答by lowtech

You can do following:

您可以执行以下操作：

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
    res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()

It produces:

它产生：

array([[ 3.,  4.,  5.],
       [ 2.,  3.,  4.],
       [ 1.,  2.,  3.]])

which I hope is very close to what you are actually want.

我希望这非常接近你真正想要的。

Answer 3

回答by ansonw

Very simple solution using pandas DataFrame:

使用 Pandas DataFrame 的非常简单的解决方案：

number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
    df['lag_' + str(lag)] = df.vals.shift(lag)

#if you want numpy arrays with no null values: 
df.dropna().values for numpy arrays

Answer 4

回答by Charlie Brummitt

I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.

我喜欢通过将列设置为 a 将滞后数字放在列中MultiIndex。这样，列的名称将被保留。

Here's an example of the result:

下面是一个结果示例：

# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
    [5, 4, 3, 2, 1],
    index=indx,
    name='population')

shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])

Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):

结果：一个带有两列标签的 MultiIndex DataFrame：原始标签（“population”）和新标签（“lag”）：

Solution: Like in the accepted solution, we use DataFrame.shiftand then pandas.concat.

解决方案：就像在接受的解决方案中一样，我们使用DataFrame.shift然后pandas.concat。

def shift_timeseries_by_lags(df, lags, lag_label='lag'):
    return pd.concat([
        shift_timeseries_and_create_multiindex_column(df, lag,
                                                      lag_label=lag_label)
        for lag in lags], axis=1)

def shift_timeseries_and_create_multiindex_column(
        dataframe, lag, lag_label='lag'):
    return (dataframe.shift(lag)
                     .pipe(append_level_to_columns_of_dataframe,
                           lag, lag_label))

I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.

我希望有一种简单的方法可以将标签列表附加到现有列。这是我的解决方案。

def append_level_to_columns_of_dataframe(
        dataframe, new_level, name_of_new_level, inplace=False):
    """Given a (possibly MultiIndex) DataFrame, append labels to the column
    labels and assign this new level a name.

    Parameters
    ----------
    dataframe : a pandas DataFrame with an Index or MultiIndex columns

    new_level : scalar, or arraylike of length equal to the number of columns
    in `dataframe`
        The labels to put on the columns. If scalar, it is broadcast into a
        list of length equal to the number of columns in `dataframe`.

    name_of_new_level : str
        The label to give the new level.

    inplace : bool, optional, default: False
        Whether to modify `dataframe` in place or to return a copy
        that is modified.

    Returns
    -------
    dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
        The original `dataframe` with new columns that have the given `level`
        appended to each column label.
    """
    old_columns = dataframe.columns

    if not hasattr(new_level, '__len__') or isinstance(new_level, str):
        new_level = [new_level] * dataframe.shape[1]

    if isinstance(dataframe.columns, pd.MultiIndex):
        new_columns = pd.MultiIndex.from_arrays(
            old_columns.levels + [new_level],
            names=(old_columns.names + [name_of_new_level]))
    elif isinstance(dataframe.columns, pd.Index):
        new_columns = pd.MultiIndex.from_arrays(
            [old_columns] + [new_level],
            names=([old_columns.name] + [name_of_new_level]))

    if inplace:
        dataframe.columns = new_columns
        return dataframe
    else:
        copy_dataframe = dataframe.copy()
        copy_dataframe.columns = new_columns
        return copy_dataframe

Update: I learned from this solutionanother way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:

更新：我从这个解决方案中学到了另一种在列中放置新级别的方法，这使得不必使用append_level_to_columns_of_dataframe：

def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
    return pd.concat({
        '{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
        df.shift(lag)
        for lag in lags},
        axis=1)

Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):

这是结果shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2])：

Answer 5

回答by Ashutosh Tripathi

For a dataframe df with the lag to be applied on 'col name', you can use the shift function.

对于延迟应用于“col name”的数据帧 df，您可以使用 shift 函数。

df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)

Answer 6

回答by Bj?rn Backg?rd

For multiple (many of them) lags, this could be more compact:

对于多个（其中许多）滞后，这可能更紧凑：

df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))

Answer 7

回答by c.Parsi

Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.

假设您专注于数据框中的单个列，并保存到 s 中。此简码将生成具有 7 个滞后的列的实例。

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
    shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)

shiftdf

>>
test_0  test_1  test_2
1   5   NaN NaN
2   4   5.0 NaN
3   3   4.0 5.0
4   2   3.0 4.0
5   1   2.0 3.0

Answer 8

回答by mac13k

Here is a cool one liner for lagged features using pd.concat:

这是一个很酷的单线用于滞后功能，使用pd.concat：

lagged = pd.concat([s.shift(lag) for lag in range(3)], axis=1).dropna()

Python 如何使用 Pandas 数据框创建滞后数据结构

提问by Mannaggia

采纳答案by Andy Hayden

回答by lowtech

回答by ansonw

回答by Charlie Brummitt

回答by Ashutosh Tripathi

回答by Bj?rn Backg?rd

回答by c.Parsi

回答by mac13k

相关推荐

最近更新

标签

Python 如何使用 Pandas 数据框创建滞后数据结构

提问by Mannaggia

采纳答案by Andy Hayden

回答by lowtech

回答by ansonw

回答by Charlie Brummitt

回答by Ashutosh Tripathi

回答by Bj?rn Backg?rd

回答by c.Parsi

回答by mac13k

相关推荐

OpenCV python：ValueError：解压的值太多

Python 将字符串（不带任何分隔符）转换为列表

python中水平方向的物理拉伸图

python pyodbc：如何连接到特定实例

相关推荐

最近更新

标签