pandas 由于“完美分离错误”而无法运行逻辑回归

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34668868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:28:26  来源:igfitidea点击:

Unable to run logistic regression due to "perfect separation error"

pythonnumpypandasmatplotliblogistic-regression

提问by Ajay Gopalan

I'm a beginner to data analysis in Python and have been having trouble with this particular assignment. I've searched quite widely, but have not been able to identify what's wrong.

我是 Python 数据分析的初学者,并且在这个特定的任务中遇到了麻烦。我已经进行了相当广泛的搜索,但无法确定出了什么问题。

I imported a file and set it up as a dataframe. Cleaned the data within the file. However, when I try to fit my model to the data, I get a

我导入了一个文件并将其设置为数据框。清理了文件中的数据。但是,当我尝试将我的模型拟合到数据时,我得到一个

Perfect separation detected, results not available

检测到完美分离,结果不可用

Here is the code:

这是代码:

from scipy import stats
import numpy as np
import pandas as pd 
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm

loansData = pd.read_csv('https://spark-   public.s3.amazonaws.com/dataanalysis/loansData.csv')

loansData = loansData.to_csv('loansData_clean.csv', header=True, index=False)

## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].map(lambda x:  round(float(x.rstrip('%')) / 100, 4))
loanlength = loansData['Loan.Length'].map(lambda x: x.strip('months'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: x.split('-'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: int(x[0]))
loansData['FICO.Score'] = loansData['FICO.Range']

#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = pd.Series('', index=loansData.index)
loansData['IR_TF'] = loansData['Interest.Rate'].map(lambda x: True if x < 12 else False)

#create intercept column
loansData['Intercept'] = pd.Series(1.0, index=loansData.index)

# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 

#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])

#fit the model
result = logit.fit()

#get fitted coef
coeff = result.params

print coeff

Any help would be much appreciated!

任何帮助将非常感激!

Thx, A

谢谢,A

回答by Happy001

You have PerfectSeparationErrorbecause your loansData['IR_TF'] only has a single value True(or 1). You first converted interest rate from % to decimal, so you should define IR_TF as

你有,PerfectSeparationError因为你的贷款数据 ['IR_TF'] 只有一个值True(或 1)。您首先将利率从 % 转换为十进制,因此您应该将 IR_TF 定义为

loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map 

Then your regression will run successfully:

然后您的回归将成功运行:

Optimization terminated successfully.
         Current function value: 0.319503
         Iterations 8
FICO.Score           0.087423
Amount.Requested    -0.000174
Intercept          -60.125045
dtype: float64

Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .mapmight not be as fast as vectorized calculations. Here are my suggestions:

此外,我注意到许多可以更容易阅读和/或获得一些性能改进的地方.map可能不如矢量化计算那么快。以下是我的建议:

from scipy import stats
import numpy as np
import pandas as pd 
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm

loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')

## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0

loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int)  --> loanlength not used below

loansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)

#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12

#create intercept column
loansData['Intercept'] = 1.0

# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 

#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])

#fit the model
result = logit.fit()

#get fitted coef
coeff = result.params

#print coeff
print result.summary() #result has more information


Logit Regression Results                           
==============================================================================
Dep. Variable:                  IR_TF   No. Observations:                 2500
Model:                          Logit   Df Residuals:                     2497
Method:                           MLE   Df Model:                            2
Date:                Thu, 07 Jan 2016   Pseudo R-squ.:                  0.5243
Time:                        23:15:54   Log-Likelihood:                -798.76
converged:                       True   LL-Null:                       -1679.2
                                        LLR p-value:                     0.000
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
FICO.Score           0.0874      0.004     24.779      0.000         0.081     0.094
Amount.Requested    -0.0002    1.1e-05    -15.815      0.000        -0.000    -0.000
Intercept          -60.1250      2.420    -24.840      0.000       -64.869   -55.381
====================================================================================

By the way -- is this P2P lending data?

顺便说一句——这是P2P借贷数据吗?