Python 将日期转换为浮点数以在 Pandas 数据框上进行线性回归

Question

提问by Quetzalcoatl

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:

似乎要使 OLS 线性回归在 Pandas 中运行良好，参数必须是浮点数。我从以下形式的 csv（称为“gameAct.csv”）开始：

date, city, players, sales

2014-04-28,London,111,1091.28

2014-04-29,London,100,1100.44

2014-04-28,Paris,87,1001.33

...

I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.

我想对销售如何取决于日期进行线性回归（随着时间的推移，销售如何变动？）。下面我的代码的问题似乎是日期不是浮点值。我将不胜感激有关如何在 Pandas 中解决此索引问题的帮助。

My current (non-working, but compiling code):

我目前的（非工作，但编译代码）：

import pandas as pd

from pandas import DataFrame, Series

import statsmodels.formula.api as sm

df = pd.read_csv('gameAct.csv')

df.columns = ['date', 'city', 'players', 'sales']

city_data = df[df['city'] == 'London']

result = sm.ols(formula = 'sales ~ date', data = city_data).fit()

As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True'in defining the dataframe df, but without success.

当我改变城市值时，我得到 R^2 = 1 结果，这是错误的。我也曾尝试index_col = 0, parse_dates == True'定义dataframe df，但没有成功。

I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!

我怀疑有更好的方法来读取这样的 csv 文件来执行日期的基本回归，以及更一般的时间序列分析。感谢帮助、示例和资源！

Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:

请注意，使用上面的代码，如果我将日期索引（对于给定城市）转换为数组，则该数组中的值的形式为：

'\xef\xbb\xbf2014-04-28'

How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).

如何对所有非销售参数进行 AIC 分析？（例如，结果可能是销售额最线性地依赖于日期和城市）。

Answer 1

采纳答案by Tom Q.

For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.

对于这种回归，我通常将日期或时间戳转换为自数据开始以来的整数天数。

This does the trick nicely:

这很好地解决了这个问题：

df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])    
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()

The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.

这种方法的优点是您可以确定回归中涉及的单位（天），而自动转换可能会隐式使用其他单位，从而在您的线性模型中创建令人困惑的系数。它还允许您将来自不同时间开始的多个销售活动的数据合并到您的回归中（假设您对作为活动开展天数的函数的活动有效性感兴趣）。如果您有兴趣衡量一年中的某一天趋势，您也可以选择 1 月 1 日作为您的 0。选择你自己的 0 日期让你可以控制这一切。

There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

还有证据表明 statsmodels 支持来自熊猫的时间序列。您也可以将其应用于线性模型：http: //statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.

另外，请注意：您应该能够像我发布的示例代码一样直接从 csv 中直接读取列名。在您的示例中，我看到 csv 文件第一行的逗号之间有空格，导致列名称为“日期”。删除空格和自动 csv 标题读取应该可以正常工作。

Answer 2

回答by Wyrmwood

I'm not sure about the specifics of the statsmodels, but this postlists all the date/time conversions for python. They aren't always one-to-one, so it's a reference I used often ;-)

我不确定 statsmodels 的细节，但这篇文章列出了 python 的所有日期/时间转换。它们并不总是一对一的，所以这是我经常使用的参考 ;-)

Answer 3

回答by dkorsakas

df.date.dt.total_seconds()

If the data type of your date is datetime64[ns]than dt.total_seconds()should work; this will return a number of seconds (float).

如果你的约会的数据类型是datetime64[ns]不是dt.total_seconds()应该工作; 这将返回秒数（浮动）。

Python 将日期转换为浮点数以在 Pandas 数据框上进行线性回归

提问by Quetzalcoatl

采纳答案by Tom Q.

回答by Wyrmwood

回答by dkorsakas

相关推荐

最近更新

标签

Python 将日期转换为浮点数以在 Pandas 数据框上进行线性回归

提问by Quetzalcoatl

采纳答案by Tom Q.

回答by Wyrmwood

回答by dkorsakas

相关推荐

Python 从列表创建逗号分隔的字符串

如何在python中将单个数字转换为单个项目列表

python numpy ValueError：操作数无法与形状一起广播

如何在通过putty打开的远程机器上运行python脚本

相关推荐

最近更新

标签