时间序列分析 - 不均匀间隔的措施 - pandas + statsmodels

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34494780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:26:15  来源:igfitidea点击:

Time Series Analysis - unevenly spaced measures - pandas + statsmodels

pythonpandasmachine-learningtime-seriesstatsmodels

提问by Robin

I have two numpy arrays light_points and time_points and would like to use some time series analysis methods on those data.

我有两个 numpy 数组 light_points 和 time_points,并且想对这些数据使用一些时间序列分析方法。

I then tried this :

然后我尝试了这个:

import statsmodels.api as sm
import pandas as pd
tdf = pd.DataFrame({'time':time_points[:]})
rdf =  pd.DataFrame({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light))
#rdf.index = pd.DatetimeIndex(tdf['time'])

This works but is not doing the correct thing. Indeed, the measurements are not evenly time-spaced and if I just declare the time_points pandas DataFrame as the index of my frame, I get an error :

这有效,但没有做正确的事情。事实上,测量不是均匀的时间间隔,如果我只是将 time_points pandas DataFrame 声明为我的帧的索引,我会得到一个错误:

rdf.index = pd.DatetimeIndex(tdf['time'])

decomp = sm.tsa.seasonal_decompose(rdf)

elif freq is None:
raise ValueError("You must specify a freq or x must be a pandas object with a timeseries index")

ValueError: You must specify a freq or x must be a pandas object with a timeseries index

I don't know how to correct this. Also, it seems that pandas' TimeSeriesare deprecated.

我不知道如何纠正这个。此外,Pandas似乎已TimeSeries被弃用。

I tried this :

我试过这个:

rdf = pd.Series({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(tdf['time'])

But it gives me a length mismatch :

但它给了我一个长度不匹配:

ValueError: Length mismatch: Expected axis has 1 elements, new values have 122 elements

Nevertheless, I don't understand where it comes from, as rdf['light'] and tdf['time'] are of same length...

尽管如此,我不明白它来自哪里,因为 rdf['light'] 和 tdf['time'] 的长度相同......

Eventually, I tried by defining my rdf as a pandas Series :

最终,我尝试将我的 rdf 定义为 pandas 系列:

rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))

And I get this :

我明白了:

ValueError: You must specify a freq or x must be a pandas object with a timeseries index

Then, I tried instead replacing the index by

然后,我尝试改为将索引替换为

 pd.TimeSeries(time_points[:])

And it gives me an error on the seasonal_decompose method line :

它给了我一个关于seasonal_decompose 方法行的错误:

AttributeError: 'Float64Index' object has no attribute 'inferred_freq'

How can I work with unevenly spaced data ? I was thinking about creating an approximately evenly spaced time array by adding many unknown values between the existing values and using interpolation to "evaluate" those points, but I think there could be a cleaner and easier solution.

如何处理不均匀间隔的数据?我正在考虑通过在现有值之间添加许多未知值并使用插值来“评估”这些点来创建一个大致均匀间隔的时间数组,但我认为可能有一个更清晰、更简单的解决方案。

回答by Stefan

seasonal_decompose()requires a freqthat is either provided as part of the DateTimeIndexmeta information, can be inferred by pandas.Index.inferred_freqor else by the user as an integerthat gives the number of periods per cycle. e.g., 12 for monthly (from docstringfor seasonal_mean):

seasonal_decompose()需要 afreq作为DateTimeIndex元信息的一部分提供,可以pandas.Index.inferred_freq由用户推断或由用户推断为integer给出每个周期的周期数。例如,每月 12 次(来自docstringfor seasonal_mean):

def seasonal_decompose(x, model="additive", filt=None, freq=None):
    """
    Parameters
    ----------
    x : array-like
        Time series
    model : str {"additive", "multiplicative"}
        Type of seasonal component. Abbreviations are accepted.
    filt : array-like
        The filter coefficients for filtering out the seasonal component.
        The default is a symmetric moving average.
    freq : int, optional
        Frequency of the series. Must be used if x is not a pandas
        object with a timeseries index.
def seasonal_decompose(x, model="additive", filt=None, freq=None):
    """
    Parameters
    ----------
    x : array-like
        Time series
    model : str {"additive", "multiplicative"}
        Type of seasonal component. Abbreviations are accepted.
    filt : array-like
        The filter coefficients for filtering out the seasonal component.
        The default is a symmetric moving average.
    freq : int, optional
        Frequency of the series. Must be used if x is not a pandas
        object with a timeseries index.

To illustrate - using random sample data:

为了说明 - 使用随机样本数据:

length = 400
x = np.sin(np.arange(length)) * 10 + np.random.randn(length)
df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), periods=length, freq='w'), columns=['value'])

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 400 entries, 2015-01-04 to 2022-08-28
Freq: W-SUN

decomp = sm.tsa.seasonal_decompose(df)
data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

Data columns (total 4 columns):
series      400 non-null float64
trend       348 non-null float64
seasonal    400 non-null float64
resid       348 non-null float64
dtypes: float64(4)
memory usage: 15.6 KB

So far, so good - now randomly dropping elements from the DateTimeIndexto create unevenly space data:

到目前为止,一切都很好 - 现在从 中随机删除元素DateTimeIndex以创建不均匀的空间数据:

df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))]

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 222 entries, 2015-01-11 to 2022-08-21
Data columns (total 1 columns):
value    222 non-null float64
dtypes: float64(1)
memory usage: 3.5 KB

df.index.freq

None

df.index.inferred_freq

None

Running the seasonal_decompon this data 'works':

seasonal_decomp在此数据上运行“有效”:

decomp = sm.tsa.seasonal_decompose(df, freq=52)

data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

DatetimeIndex: 224 entries, 2015-01-04 to 2022-08-07
Data columns (total 4 columns):
series      224 non-null float64
trend       172 non-null float64
seasonal    224 non-null float64
resid       172 non-null float64
dtypes: float64(4)
memory usage: 8.8 KB

The question is - how useful is the result. Even without gaps in the data that complicate inference of seasonal patterns (see example use of .interpolate()in the release notes, statsmodelsqualifies this procedure as follows:

问题是 - 结果有多大用处。即使数据中没有使季节性模式推断复杂化的差距(请参阅发行说明.interpolate()中的示例使用,此过程的限定如下:statsmodels

Notes
-----
This is a naive decomposition. More sophisticated methods should
be preferred.

The additive model is Y[t] = T[t] + S[t] + e[t]

The multiplicative model is Y[t] = T[t] * S[t] * e[t]

The seasonal component is first removed by applying a convolution
filter to the data. The average of this smoothed series for each
period is the returned seasonal component.
Notes
-----
This is a naive decomposition. More sophisticated methods should
be preferred.

The additive model is Y[t] = T[t] + S[t] + e[t]

The multiplicative model is Y[t] = T[t] * S[t] * e[t]

The seasonal component is first removed by applying a convolution
filter to the data. The average of this smoothed series for each
period is the returned seasonal component.