Python Pandas:检测时间序列的频率

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31517728/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:38:51  来源:igfitidea点击:

Python Pandas: Detecting frequency of time series

pythonpandas

提问by Jim

Assume I have loaded a time series data from sql or csv (not created in python), the index would be:

假设我已经从 sql 或 csv(不是在 python 中创建的)加载了一个时间序列数据,索引将是:

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

As you can see, the 'freq' is None. I am wondering how can I detect the frequency of this series and set the 'freq' as its frequency.

如您所见,“频率”为“无”。我想知道如何检测这个系列的频率并将“频率”设置为它的频率。

If possible, I wish this would work in case of the data isn't continuous (there are plenty of breaks in the series).

如果可能,我希望这在数据不连续的情况下有效(该系列中有很多中断)。

I was trying to find the mode of all the differences between two timestamp but I am not sure how to transfer it into a format that readable by Series

我试图找到两个时间戳之间所有差异的模式,但我不确定如何将其转换为系列可读的格式

采纳答案by Jianxun Li

Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.

也许尝试采用时间索引的差异并使用模式(或最小差异)作为频率。

import pandas as pd
import numpy as np

# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df

                        col
2015-03-02 01:00:00  2.0261
2015-03-02 04:00:00  1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00  0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00  1.8453
...                     ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00  1.1962
2015-07-19 15:00:00  1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00  0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00  0.5049
2015-07-19 23:00:00 -0.5349

[2000 rows x 1 columns]

# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()

01:00:00    1181
02:00:00     499
03:00:00     180
04:00:00      93
05:00:00      24
06:00:00      10
07:00:00       9
08:00:00       3
dtype: int64

# the mode can be considered as frequency
res.index[0]  # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min()  # output: Timedelta('0 days 01:00:00')




# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
               '2015-03-02 03:00:00', '2015-03-02 04:00:00',
               '2015-03-02 05:00:00', '2015-03-02 06:00:00',
               '2015-03-02 07:00:00', '2015-03-02 08:00:00',
               '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', length=3359, freq='H', tz=None)

回答by Delforge

It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:

值得一提的是,如果数据是连续的,可以使用pandas.DateTimeIndex.inferred_freq属性:

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'

or pandas.infer_freqmethod:

pandas.infer_freq方法:

pd.infer_freq(dt_ix)
Out[3]: 'H'

If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diffmethod:

如果不是连续的 pandas.infer_freq 将返回 None。与已经提出的类似,另一种选择是使用pandas.Series.diff方法:

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

回答by mdurant

The minimum time difference is found with

最小时间差为

np.diff(data.index.values).min()

which is normally in units of ns. To get a frequency, assuming ns:

通常以ns为单位。要获得频率,假设 ns:

freq = 1e9 / np.diff(df.index.values).min().astype(int)