Python 如何迭代熊猫数据框的列以运行回归

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28218698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:56:14  来源:igfitidea点击:

How to iterate over columns of pandas dataframe to run regression

pythonpandasstatsmodels

提问by itzy

I'm sure this is simple, but as a complete newbie to python, I'm having trouble figuring out how to iterate over variables in a pandasdataframe and run a regression with each.

我确定这很简单,但是作为 Python 的完全新手,我无法弄清楚如何迭代pandas数据帧中的变量并使用每个变量运行回归。

Here's what I'm doing:

这是我在做什么:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

I know I can run a regression like this:

我知道我可以运行这样的回归:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.

但假设我想对数据框中的每一列执行此操作。特别是,我想在 FSTMX 上回归 FIUIX,然后在 FSTMX 上回归 FSAIX,然后在 FSTMX 上回归 FSAVX。每次回归后,我想存储残差。

I've tried various versions of the following, but I must be getting the syntax wrong:

我已经尝试了以下各种版本,但我一定是语法错误:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

I think the problem is I don't know how to refer to the returns column by key, so returns[k]is probably wrong.

我认为问题是我不知道如何按键引用返回列,所以returns[k]可能是错误的。

Any guidance on the best way to do this would be much appreciated. Perhaps there's a common pandas approach I'm missing.

任何有关执行此操作的最佳方法的指导将不胜感激。也许我缺少一种常见的熊猫方法。

采纳答案by The Unfun Cat

for column in df:
    print(df[column])

回答by JAB

You can index dataframe columns by the position using ix.

您可以使用ix.

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

例如,这将返回第一列。(0 将是索引)

df1.ix[0,]

This returns the first row.

这将返回第一行。

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

这将是第 0 行和第 1 列交叉处的值:

df1.ix[0,1]

and so on. So you can enumerate()returns.keys():and use the number to index the dataframe.

等等。所以你可以enumerate()returns.keys():使用这个数字来索引数据帧。

回答by kdauria

A workaround is to transpose the DataFrameand iterate over the rows.

一种解决方法是转置DataFrame并迭代行。

for column_name, column in df.transpose().iterrows():
    print column_name

回答by mdh

You can use iteritems():

您可以使用iteritems()

for name, values in df.iteritems():
    print('{name}: {value}'.format(name=name, value=values[0]))

回答by MEhsan

Using list comprehension, you can get all the columns names (header):

使用列表理解,您可以获得所有列名称(标题):

[column for column in df]

[column for column in df]

回答by Gaurav

I'm a bit late but here's how I did this. The steps:

我有点晚了,但这就是我这样做的方式。步骤:

  1. Create a list of all columns
  2. Use itertools to take x combinations
  3. Append each result R squared value to a result dataframe along with excluded column list
  4. Sort the result DF in descending order of R squared to see which is the best fit.
  1. 创建所有列的列表
  2. 使用 itertools 取 x 个组合
  3. 将每个结果 R 平方值与排除的列列表一起附加到结果数据框中
  4. 按 R 平方的降序对结果 DF 进行排序,以查看哪个最适合。

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

这是我在 DataFrame 上使用的代码,称为aft_tmt. 随意推断您的用例..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

回答by Abhinav Gupta

This answer is to iterate over selected columnsas well as all columns in a DF.

这个答案是迭代选定的列以及 DF 中的所有列。

df.columnsgives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

df.columns给出一个包含 DF 中所有列名称的列表。现在,如果您想遍历所有列,这不是很有帮助。但是当您只想迭代您选择的列时,它会派上用场。

We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

我们可以使用 Python 的列表切片轻松地根据需要对 df.columns 进行切片。例如,要遍历除第一列之外的所有列,我们可以执行以下操作:

for column in df.columns[1:]:
    print(df[column])

Similarly to iterate over all the columns in reversed order, we can do:

类似于以相反的顺序迭代所有列,我们可以这样做:

for column in df.columns[::-1]:
    print(df[column])

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

我们可以使用这种技术以很多很酷的方式迭代所有列。还请记住,您可以使用以下方法轻松获取所有列的索引:

for ind, column in enumerate(df.columns):
    print(ind, column)

回答by Herpes Free Engineer

Based on the accepted answer, if an indexcorresponding to each column is also desired:

根据接受的答案,如果还需要与每列对应的索引

for i, column in enumerate(df):
    print i, df[column]

The above df[column]type is Series, which can simply be converted into numpyndarrays:

上面的df[column]类型是Series,可以简单地转换成numpyndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])