时间序列分析 - 不均匀间隔的措施 - pandas + statsmodels
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34494780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Time Series Analysis - unevenly spaced measures - pandas + statsmodels
提问by Robin
I have two numpy arrays light_points and time_points and would like to use some time series analysis methods on those data.
我有两个 numpy 数组 light_points 和 time_points,并且想对这些数据使用一些时间序列分析方法。
I then tried this :
然后我尝试了这个:
import statsmodels.api as sm
import pandas as pd
tdf = pd.DataFrame({'time':time_points[:]})
rdf = pd.DataFrame({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light))
#rdf.index = pd.DatetimeIndex(tdf['time'])
This works but is not doing the correct thing. Indeed, the measurements are not evenly time-spaced and if I just declare the time_points pandas DataFrame as the index of my frame, I get an error :
这有效,但没有做正确的事情。事实上,测量不是均匀的时间间隔,如果我只是将 time_points pandas DataFrame 声明为我的帧的索引,我会得到一个错误:
rdf.index = pd.DatetimeIndex(tdf['time'])
decomp = sm.tsa.seasonal_decompose(rdf)
elif freq is None:
raise ValueError("You must specify a freq or x must be a pandas object with a timeseries index")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index
I don't know how to correct this.
Also, it seems that pandas' TimeSeries
are deprecated.
我不知道如何纠正这个。此外,Pandas似乎已TimeSeries
被弃用。
I tried this :
我试过这个:
rdf = pd.Series({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(tdf['time'])
But it gives me a length mismatch :
但它给了我一个长度不匹配:
ValueError: Length mismatch: Expected axis has 1 elements, new values have 122 elements
Nevertheless, I don't understand where it comes from, as rdf['light'] and tdf['time'] are of same length...
尽管如此,我不明白它来自哪里,因为 rdf['light'] 和 tdf['time'] 的长度相同......
Eventually, I tried by defining my rdf as a pandas Series :
最终,我尝试将我的 rdf 定义为 pandas 系列:
rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))
And I get this :
我明白了:
ValueError: You must specify a freq or x must be a pandas object with a timeseries index
Then, I tried instead replacing the index by
然后,我尝试改为将索引替换为
pd.TimeSeries(time_points[:])
And it gives me an error on the seasonal_decompose method line :
它给了我一个关于seasonal_decompose 方法行的错误:
AttributeError: 'Float64Index' object has no attribute 'inferred_freq'
How can I work with unevenly spaced data ? I was thinking about creating an approximately evenly spaced time array by adding many unknown values between the existing values and using interpolation to "evaluate" those points, but I think there could be a cleaner and easier solution.
如何处理不均匀间隔的数据?我正在考虑通过在现有值之间添加许多未知值并使用插值来“评估”这些点来创建一个大致均匀间隔的时间数组,但我认为可能有一个更清晰、更简单的解决方案。
回答by Stefan
seasonal_decompose()
requires a freq
that is either provided as part of the DateTimeIndex
meta information, can be inferred by pandas.Index.inferred_freq
or else by the user as an integer
that gives the number of periods per cycle. e.g., 12 for monthly (from docstring
for seasonal_mean
):
seasonal_decompose()
需要 afreq
作为DateTimeIndex
元信息的一部分提供,可以pandas.Index.inferred_freq
由用户推断或由用户推断为integer
给出每个周期的周期数。例如,每月 12 次(来自docstring
for seasonal_mean
):
def seasonal_decompose(x, model="additive", filt=None, freq=None): """ Parameters ---------- x : array-like Time series model : str {"additive", "multiplicative"} Type of seasonal component. Abbreviations are accepted. filt : array-like The filter coefficients for filtering out the seasonal component. The default is a symmetric moving average. freq : int, optional Frequency of the series. Must be used if x is not a pandas object with a timeseries index.
def seasonal_decompose(x, model="additive", filt=None, freq=None): """ Parameters ---------- x : array-like Time series model : str {"additive", "multiplicative"} Type of seasonal component. Abbreviations are accepted. filt : array-like The filter coefficients for filtering out the seasonal component. The default is a symmetric moving average. freq : int, optional Frequency of the series. Must be used if x is not a pandas object with a timeseries index.
To illustrate - using random sample data:
为了说明 - 使用随机样本数据:
length = 400
x = np.sin(np.arange(length)) * 10 + np.random.randn(length)
df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), periods=length, freq='w'), columns=['value'])
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 400 entries, 2015-01-04 to 2022-08-28
Freq: W-SUN
decomp = sm.tsa.seasonal_decompose(df)
data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']
Data columns (total 4 columns):
series 400 non-null float64
trend 348 non-null float64
seasonal 400 non-null float64
resid 348 non-null float64
dtypes: float64(4)
memory usage: 15.6 KB
So far, so good - now randomly dropping elements from the DateTimeIndex
to create unevenly space data:
到目前为止,一切都很好 - 现在从 中随机删除元素DateTimeIndex
以创建不均匀的空间数据:
df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 222 entries, 2015-01-11 to 2022-08-21
Data columns (total 1 columns):
value 222 non-null float64
dtypes: float64(1)
memory usage: 3.5 KB
df.index.freq
None
df.index.inferred_freq
None
Running the seasonal_decomp
on this data 'works':
seasonal_decomp
在此数据上运行“有效”:
decomp = sm.tsa.seasonal_decompose(df, freq=52)
data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']
DatetimeIndex: 224 entries, 2015-01-04 to 2022-08-07
Data columns (total 4 columns):
series 224 non-null float64
trend 172 non-null float64
seasonal 224 non-null float64
resid 172 non-null float64
dtypes: float64(4)
memory usage: 8.8 KB
The question is - how useful is the result. Even without gaps in the data that complicate inference of seasonal patterns (see example use of .interpolate()
in the release notes, statsmodels
qualifies this procedure as follows:
问题是 - 结果有多大用处。即使数据中没有使季节性模式推断复杂化的差距(请参阅发行说明.interpolate()
中的示例使用,此过程的限定如下:statsmodels
Notes ----- This is a naive decomposition. More sophisticated methods should be preferred. The additive model is Y[t] = T[t] + S[t] + e[t] The multiplicative model is Y[t] = T[t] * S[t] * e[t] The seasonal component is first removed by applying a convolution filter to the data. The average of this smoothed series for each period is the returned seasonal component.
Notes ----- This is a naive decomposition. More sophisticated methods should be preferred. The additive model is Y[t] = T[t] + S[t] + e[t] The multiplicative model is Y[t] = T[t] * S[t] * e[t] The seasonal component is first removed by applying a convolution filter to the data. The average of this smoothed series for each period is the returned seasonal component.