Python 使用 Scikit-learn 对日期变量进行回归

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16453644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:39:58  来源:igfitidea点击:

Regression with Date variable using Scikit-learn

pythonpython-2.7numpypandasscikit-learn

提问by Nyxynyx

I have a Pandas DataFrame with a datecolumn (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_trainand try to fit the regression model, I get the error float() argument must be a string or a number. Removing the datecolumn avoided this error.

我有一个数据帧大熊猫与date列(如2013-04-01D型细胞)datetime.date。当我包含该列X_train并尝试拟合回归模型时,出现错误float() argument must be a string or a number。删除date列避免了这个错误。

What is the proper way to take the dateinto account in the regression model?

date在回归模型中考虑的正确方法是什么?

Code

代码

data = sql.read_frame(...)
X_train = data.drop('y', axis=1)
y_train = data.y

rf = RandomForestRegressor().fit(X_train, y_train)

Error

错误

TypeError                                 Traceback (most recent call last)
<ipython-input-35-8bf6fc450402> in <module>()
----> 2 rf = RandomForestRegressor().fit(X_train, y_train)

C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
    292                 X.ndim != 2 or
    293                 not X.flags.fortran):
--> 294             X = array2d(X, dtype=DTYPE, order="F")
    295 
    296         n_samples, self.n_features_ = X.shape

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

TypeError: float() argument must be a string or a number

回答by Ando Saabas

You have two options. You can convert the date to an ordinal i.e. an integer representing the number of days since year 1 day 1. You can do this by a datetime.date's toordinalfunction.

你有两个选择。您可以将日期转换为序数,即表示自第 1 天第 1 年以来的天数的整数。您可以通过 adatetime.datetoordinal函数执行此操作。

Alternatively, you can turn the dates into categorical variables using sklearn's OneHotEncoder. What it does is create a new variable for each distinct date. So instead of something like column datewith values ['2013-04-01', '2013-05-01'], you will have two columns, date_2013_04_01with values [1, 0]and date_2013_05_01with values [0, 1].

或者,您可以使用 sklearn 的OneHotEncoder将日期转换为分类变量。它所做的是为每个不同的日期创建一个新变量。因此,而不是像柱date与价值观['2013-04-01', '2013-05-01'],你将有两列,date_2013_04_01其值[1, 0]date_2013_05_01其值[0, 1]

I would recommend using the toordinalapproach if you have many different dates, and the one hot encoder if the number of distinct dates is small (let's say up to 10 - 100, depending on the size of your data and what sort of relation the date has with the output variable).

toordinal如果您有许多不同的日期,我会建议使用该方法,如果不同日期的数量很少,我会建议使用一个热编码器(假设最多为 10 - 100,具体取决于您的数据大小和日期的关系类型)与输出变量)。

回答by ogrisel

The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

最好的方法是将日期分解为一组使用 1-of-K 编码(例如由DictVectorizer完成)以布尔形式编码的分类特征。以下是可以从日期中提取的一些特征:

  • hour of the day (24 boolean features)
  • day of the week (7 boolean features)
  • day of the month (up to 31 boolean features)
  • month of the year (12 boolean features)
  • year (as many boolean features as they are different years in your dataset) ...
  • 一天中的小时(24 个布尔特征)
  • 星期几(7 个布尔特征)
  • 一个月中的哪一天(最多 31 个布尔特征)
  • 一年中的月份(12 个布尔特征)
  • 年(与数据集中不同年份一样多的布尔特征)...

That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

这应该可以识别典型人类生命周期中周期性事件的线性依赖性。

Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.

此外,您还可以将日期提取为单个浮点数:将每个日期转换为自训练集最小日期以来的天数,然后除以最大日期与最小日期之间的天数之差. 该数值特征应该可以识别事件日期输出之间的长期趋势:例如回归问题中的线性斜率,以更好地预测未来年份的演变,而不能用年份特征的布尔分类变量进行编码.

回答by Danylo Zherebetskyy

Before doing boolean encoding using the 1-of-K encoding suggested by @ogrisel, you may try enriching your data and playing with the number of features that you can extract from the datetime-type, i.e. day of week, day of month, day of year, week of year, quarter, etc. See for example https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.weekofyear.htmland links to other functions.

在使用@ogrisel 建议的 1-of-K 编码进行布尔编码之前,您可以尝试丰富您的数据并使用您可以从日期时间类型中提取的特征数量,即星期几、月份中的哪一天、天年、年中的一周、季度等。参见例如https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.weekofyear.html和其他函数的链接。

回答by Sebastian N

Often it's better to keep the amount of features low and there is not much information necessary from the timestamp. In my case it was enough to keep the date as a day-difference from the initial timestamp. This keeps the order and will leave you with only one (ordinal) feature.

通常,最好将特征数量保持在较低水平,并且时间戳中不需要太多信息。在我的情况下,将日期保留为与初始时间戳的日差就足够了。这会保持顺序,并且只会给您留下一个(序数)功能。

df['DAY_DELTA'] = (df.TIMESTAMP - df.TIMESTAMP.min()).dt.days

Of cause this will not identify behaviour within one day (hour dependent). So maybe you wanna go down to the scale that identifyes changing behaviour in your data the best.

当然,这不会识别一天内的行为(取决于小时)。因此,也许您想缩小到能够最好地识别数据中不断变化的行为的规模。

For Hours:

用了几个小时:

df['HOURS_DELTA'] = (df.TIMESTAMP - df.TIMESTAMP.min()).dt.components['hours']

The code above adds a new column with the delta value, to remove the old TIMESTAMP do this afterwards:

上面的代码添加了一个带有 delta 值的新列,要删除旧的 TIMESTAMP,然后执行以下操作:

df = df.drop('TIMESTAMP', axis=1)