pandas 带有虚拟/分类变量的线性回归

Question

提问by Héctor Alonso

I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit()?.

我有一组数据。我已经使用Pandas将它们分别转换为虚拟变量和分类变量。所以，现在我想知道，如何在 Python 中运行多元线性回归（我正在使用 statsmodels）？是否有一些考虑，或者我可能必须以某种方式在我的代码中指出变量是虚拟的/分类的？或者也许变量的转换就足够了，我只需要将回归作为model = sm.OLS(y, X).fit()?运行。

My code is the following:

我的代码如下：

datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)

I get this:

我明白了：

Age  Gender    Wage         Job         Classification 
32    Male  450000       Professor           High
28    Male  500000  Administrative           High
40  Female   20000       Professor            Low
47    Male   70000       Assistant         Medium
50  Female  345000       Professor         Medium
27  Female  156000       Assistant            Low
56    Male  432000  Administrative            Low
43  Female  100000  Administrative            Low

Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way:

然后我做：1=男，0=女和1：教授，2：行政，3：助理这样：

df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
        df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)

Getting this:

得到这个：

 Age  Gender    Wage             Job Classification  Sex_male  Job_index
 32    Male  450000       Professor           High         1          1
 28    Male  500000  Administrative           High         1          2
 40  Female   20000       Professor            Low         0          1
 47    Male   70000       Assistant         Medium         1          3
 50  Female  345000       Professor         Medium         0          1
 27  Female  156000       Assistant            Low         0          3
 56    Male  432000  Administrative            Low         1          2
 43  Female  100000  Administrative            Low         0          2

Now, if I would run a multiple linear regression, for example:

现在，如果我要运行多元线性回归，例如：

y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)

The result is shown normally, but would it be fine? Or do I have to indicate somehow that the variables are dummy or categorical?. Please help, I am new to Python and I want to learn. Greetings from South America - Chile.

结果显示正常，但是这样可以吗？或者我必须以某种方式表明变量是虚拟的还是分类的？请帮助，我是 Python 新手，我想学习。来自南美洲 - 智利的问候。

Answer 1

回答by Harvey

In linear regression with categoricalvariables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here

在使用分类变量的线性回归中，您应该小心虚拟变量陷阱。虚拟变量陷阱是自变量多重共线的场景——两个或多个变量高度相关的场景；简单来说，可以从其他变量中预测一个变量。这会产生模型的奇异性，这意味着您的模型将无法工作。在这里阅读

Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose and relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.

想法是使用虚拟变量编码与drop_first=True，这将在将分类变量转换为虚拟/指标变量后从每个类别中省略一列。这样做不会丢失相关信息，因为数据集中的所有点都可以由其余功能完全解释。

Here is complete code on how you can do it for your jobs dataset

这是有关如何为您的工作数据集执行此操作的完整代码

So you have your X features:

所以你有你的 X 特征：

Age, Gender, Job, Classification

And one numerical features that you are trying to predict:

以及您试图预测的一个数字特征：

Wage

First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:

首先，您需要根据输入变量和预测拆分初始数据集，假设其 Pandas 数据框如下所示：

Input variables (your dataset is bit different but whole code remains the same, you will put every column from dataset in X, except one that will go to Y. pd.get_dummies works without problem like that - it will just convert categorical variables and it won't touch numerical):

输入变量（你的数据集有点不同，但整个代码保持不变，你将把数据集中的每一列放在 X 中，除了会转到 Y 的一列。pd.get_dummies 工作没有问题 - 它只会转换分类变量，它不会碰数字）：

X = jobs[['Age','Gender','Job','Classification']]

Prediction:

预言：

Y = jobs['Wage']

Convert categorical variable into dummy/indicator variables and drop one in each category:

将分类变量转换为虚拟/指标变量并在每个类别中删除一个：

X = pd.get_dummies(data=X, drop_first=True)

So now if you check shape of X (X.shape) with drop_first=Trueyou will see that it has 4 columns less - one for each of your categorical variables.

所以现在如果你检查 X (X.shape) 的形状，drop_first=True你会发现它少了 4 列 - 每个分类变量都有一个。

You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:

您现在可以继续在您的线性模型中使用它们。对于 scikit-learn 实现，它可能如下所示：

from sklearn import linear_model
from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
        regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
        regr.fit(X_train, Y_train)
    predicted = regr.predict(X_test)

Answer 2

回答by andrew_reece

You'll need to indicate that either Jobor Job_indexis a categorical variable; otherwise, in the case of Job_indexit will be treated as a continuous variable (which just happens to take values 1, 2, and 3), which isn't right.

您需要表明，要么Job或者Job_index是分类变量; 否则，在的情况下，Job_index它将被视为作为连续变量（这恰好取值1，2和3），这是不正确的。

You can use a few different kinds of notation in statsmodels, here's the formula approach, which uses C()to indicate a categorical variable:

您可以在中使用几种不同类型的符号statsmodels，这里是公式方法，C()用于指示分类变量：

from statsmodels.formula.api import ols

fit = ols('Wage ~ C(Sex_male) + C(Job) + Age', data=df).fit() 

fit.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wage   R-squared:                       0.592
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     1.089
Date:                Wed, 06 Jun 2018   Prob (F-statistic):              0.492
Time:                        22:35:43   Log-Likelihood:                -104.59
No. Observations:                   8   AIC:                             219.2
Df Residuals:                       3   BIC:                             219.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             3.67e+05   3.22e+05      1.141      0.337   -6.57e+05    1.39e+06
C(Sex_male)[T.1]     2.083e+05   1.39e+05      1.498      0.231   -2.34e+05    6.51e+05
C(Job)[T.Assistant] -2.167e+05   1.77e+05     -1.223      0.309    -7.8e+05    3.47e+05
C(Job)[T.Professor] -9273.0556   1.61e+05     -0.058      0.958   -5.21e+05    5.03e+05
Age                 -3823.7419   6850.345     -0.558      0.616   -2.56e+04     1.8e+04
==============================================================================
Omnibus:                        0.479   Durbin-Watson:                   1.620
Prob(Omnibus):                  0.787   Jarque-Bera (JB):                0.464
Skew:                          -0.108   Prob(JB):                        0.793
Kurtosis:                       1.839   Cond. No.                         215.
==============================================================================

Note: Joband Job_indexwon't use the same categorical level as a baseline, so you'll see slightly different results for the dummy coefficients at each level, even though the overall model fit remains the same.

注意：JobandJob_index不会使用相同的分类级别作为基线，因此您会看到每个级别的虚拟系数结果略有不同，即使整体模型拟合保持不变。

pandas 带有虚拟/分类变量的线性回归

提问by Héctor Alonso

回答by Harvey

回答by andrew_reece

相关推荐

最近更新

标签

pandas 带有虚拟/分类变量的线性回归

提问by Héctor Alonso

回答by Harvey

回答by andrew_reece

相关推荐

pandas 类型错误：无法识别的值类型：<class 'str'>

使用 Pandas 将 JSON 转换为 CSV

pandas ValueError ：“color kwarg 每个数据集必须有一种颜色” matplotlib

pandas 在熊猫的给定范围内生成随机日期

相关推荐

最近更新

标签