Python 将日期转换为浮点数以在 Pandas 数据框上进行线性回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24588437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert date to float for linear regression on Pandas data frame
提问by Quetzalcoatl
It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:
似乎要使 OLS 线性回归在 Pandas 中运行良好,参数必须是浮点数。我从以下形式的 csv(称为“gameAct.csv”)开始:
date, city, players, sales
2014-04-28,London,111,1091.28
2014-04-29,London,100,1100.44
2014-04-28,Paris,87,1001.33
...
I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.
我想对销售如何取决于日期进行线性回归(随着时间的推移,销售如何变动?)。下面我的代码的问题似乎是日期不是浮点值。我将不胜感激有关如何在 Pandas 中解决此索引问题的帮助。
My current (non-working, but compiling code):
我目前的(非工作,但编译代码):
import pandas as pd
from pandas import DataFrame, Series
import statsmodels.formula.api as sm
df = pd.read_csv('gameAct.csv')
df.columns = ['date', 'city', 'players', 'sales']
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date', data = city_data).fit()
As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True'
in defining the dataframe df
, but without success.
当我改变城市值时,我得到 R^2 = 1 结果,这是错误的。我也曾尝试index_col = 0, parse_dates == True'
定义dataframe df
,但没有成功。
I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!
我怀疑有更好的方法来读取这样的 csv 文件来执行日期的基本回归,以及更一般的时间序列分析。感谢帮助、示例和资源!
Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:
请注意,使用上面的代码,如果我将日期索引(对于给定城市)转换为数组,则该数组中的值的形式为:
'\xef\xbb\xbf2014-04-28'
How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).
如何对所有非销售参数进行 AIC 分析?(例如,结果可能是销售额最线性地依赖于日期和城市)。
采纳答案by Tom Q.
For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.
对于这种回归,我通常将日期或时间戳转换为自数据开始以来的整数天数。
This does the trick nicely:
这很好地解决了这个问题:
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()
The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.
这种方法的优点是您可以确定回归中涉及的单位(天),而自动转换可能会隐式使用其他单位,从而在您的线性模型中创建令人困惑的系数。它还允许您将来自不同时间开始的多个销售活动的数据合并到您的回归中(假设您对作为活动开展天数的函数的活动有效性感兴趣)。如果您有兴趣衡量一年中的某一天趋势,您也可以选择 1 月 1 日作为您的 0。选择你自己的 0 日期让你可以控制这一切。
There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
还有证据表明 statsmodels 支持来自熊猫的时间序列。您也可以将其应用于线性模型:http: //statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html
Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.
另外,请注意:您应该能够像我发布的示例代码一样直接从 csv 中直接读取列名。在您的示例中,我看到 csv 文件第一行的逗号之间有空格,导致列名称为“日期”。删除空格和自动 csv 标题读取应该可以正常工作。
回答by Wyrmwood
回答by dkorsakas
df.date.dt.total_seconds()
If the data type of your date is datetime64[ns]
than dt.total_seconds()
should work; this will return a number of seconds (float).
如果你的约会的数据类型是datetime64[ns]
不是dt.total_seconds()
应该工作; 这将返回秒数(浮动)。