按时间索引时,将 Pandas 数据帧拆分为训练集和测试集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27886331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:49:54  来源:igfitidea点击:

splitting pandas dataframe into training and test sets when indexed by time

pythonpandas

提问by azuric

If I have a dataframe indexed by time how can split it into training and test sets 2/3rds training and 1/3rd test?

如果我有一个按时间索引的数据框,如何将其拆分为训练和测试集 2/3 训练和 1/3 测试?

Do I have to create a new column of continuously increasing integers and then use set_index() to the new integer column?

我是否必须创建一个连续增加的整数的新列,然后对新的整数列使用 set_index() ?

Or can I do it whilst keeping the time index? if so I have no idea how to do this.

或者我可以在保持时间索引的同时做到这一点吗?如果是这样,我不知道该怎么做。

Do I have to pick a date manually to act as the split point or is there some other way?

我必须手动选择一个日期作为分割点还是有其他方法?

回答by EdChum

Just use ilocwhich is an integer based indexing method, the fact the index is a time dtype is irrelevant when using iloc:

只需使用iloc基于整数的索引方法,索引是时间数据类型的事实在使用时无关紧要iloc

In [6]:

df = pd.DataFrame({'a':['1','2','3','4','5']})
df.iloc[0: floor(2 * len(df)/3)]

C:\WinPython-64bit-3.3.5.0\python-3.3.5.amd64\lib\site-packages\pandas\core\index.py:687: FutureWarning: slice indexers when using iloc should be integers and not floating point
  "and not floating point",FutureWarning)
Out[6]:
   a
0  1
1  2
2  3
In [7]:

df.iloc[floor(2 * len(df) /3):]
Out[7]:
   a
3  4
4  5

You can ignore the warning here, the use of floor is because 3.3333 is not a valid index value

可以忽略这里的警告,使用 floor 是因为 3.3333 不是有效的索引值

You can also use scikit-learns cross-validationmethod which will return train-test split indices for you.

您还可以使用 scikit-learns交叉验证方法,该方法将为您返回训练测试拆分索引。