pandas 由于“完美分离错误”而无法运行逻辑回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34668868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unable to run logistic regression due to "perfect separation error"
提问by Ajay Gopalan
I'm a beginner to data analysis in Python and have been having trouble with this particular assignment. I've searched quite widely, but have not been able to identify what's wrong.
我是 Python 数据分析的初学者,并且在这个特定的任务中遇到了麻烦。我已经进行了相当广泛的搜索,但无法确定出了什么问题。
I imported a file and set it up as a dataframe. Cleaned the data within the file. However, when I try to fit my model to the data, I get a
我导入了一个文件并将其设置为数据框。清理了文件中的数据。但是,当我尝试将我的模型拟合到数据时,我得到一个
Perfect separation detected, results not available
检测到完美分离,结果不可用
Here is the code:
这是代码:
from scipy import stats
import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm
loansData = pd.read_csv('https://spark- public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData = loansData.to_csv('loansData_clean.csv', header=True, index=False)
## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].map(lambda x: round(float(x.rstrip('%')) / 100, 4))
loanlength = loansData['Loan.Length'].map(lambda x: x.strip('months'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: x.split('-'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: int(x[0]))
loansData['FICO.Score'] = loansData['FICO.Range']
#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = pd.Series('', index=loansData.index)
loansData['IR_TF'] = loansData['Interest.Rate'].map(lambda x: True if x < 12 else False)
#create intercept column
loansData['Intercept'] = pd.Series(1.0, index=loansData.index)
# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept']
#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])
#fit the model
result = logit.fit()
#get fitted coef
coeff = result.params
print coeff
Any help would be much appreciated!
任何帮助将非常感激!
Thx, A
谢谢,A
回答by Happy001
You have PerfectSeparationError
because your loansData['IR_TF'] only has a single value True
(or 1). You first converted interest rate from % to decimal, so you should define IR_TF as
你有,PerfectSeparationError
因为你的贷款数据 ['IR_TF'] 只有一个值True
(或 1)。您首先将利率从 % 转换为十进制,因此您应该将 IR_TF 定义为
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map
Then your regression will run successfully:
然后您的回归将成功运行:
Optimization terminated successfully.
Current function value: 0.319503
Iterations 8
FICO.Score 0.087423
Amount.Requested -0.000174
Intercept -60.125045
dtype: float64
Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .map
might not be as fast as vectorized calculations. Here are my suggestions:
此外,我注意到许多可以更容易阅读和/或获得一些性能改进的地方.map
可能不如矢量化计算那么快。以下是我的建议:
from scipy import stats
import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0
loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int) --> loanlength not used below
loansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)
#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12
#create intercept column
loansData['Intercept'] = 1.0
# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept']
#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])
#fit the model
result = logit.fit()
#get fitted coef
coeff = result.params
#print coeff
print result.summary() #result has more information
Logit Regression Results
==============================================================================
Dep. Variable: IR_TF No. Observations: 2500
Model: Logit Df Residuals: 2497
Method: MLE Df Model: 2
Date: Thu, 07 Jan 2016 Pseudo R-squ.: 0.5243
Time: 23:15:54 Log-Likelihood: -798.76
converged: True LL-Null: -1679.2
LLR p-value: 0.000
====================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------------
FICO.Score 0.0874 0.004 24.779 0.000 0.081 0.094
Amount.Requested -0.0002 1.1e-05 -15.815 0.000 -0.000 -0.000
Intercept -60.1250 2.420 -24.840 0.000 -64.869 -55.381
====================================================================================
By the way -- is this P2P lending data?
顺便说一句——这是P2P借贷数据吗?