pandas 由于“完美分离错误”而无法运行逻辑回归

Question

提问by Ajay Gopalan

I'm a beginner to data analysis in Python and have been having trouble with this particular assignment. I've searched quite widely, but have not been able to identify what's wrong.

我是 Python 数据分析的初学者，并且在这个特定的任务中遇到了麻烦。我已经进行了相当广泛的搜索，但无法确定出了什么问题。

I imported a file and set it up as a dataframe. Cleaned the data within the file. However, when I try to fit my model to the data, I get a

我导入了一个文件并将其设置为数据框。清理了文件中的数据。但是，当我尝试将我的模型拟合到数据时，我得到一个

Perfect separation detected, results not available

检测到完美分离，结果不可用

Here is the code:

这是代码：

from scipy import stats
import numpy as np
import pandas as pd 
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm

loansData = pd.read_csv('https://spark-   public.s3.amazonaws.com/dataanalysis/loansData.csv')

loansData = loansData.to_csv('loansData_clean.csv', header=True, index=False)

## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].map(lambda x:  round(float(x.rstrip('%')) / 100, 4))
loanlength = loansData['Loan.Length'].map(lambda x: x.strip('months'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: x.split('-'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: int(x[0]))
loansData['FICO.Score'] = loansData['FICO.Range']

#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = pd.Series('', index=loansData.index)
loansData['IR_TF'] = loansData['Interest.Rate'].map(lambda x: True if x < 12 else False)

#create intercept column
loansData['Intercept'] = pd.Series(1.0, index=loansData.index)

# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 

#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])

#fit the model
result = logit.fit()

#get fitted coef
coeff = result.params

print coeff

Any help would be much appreciated!

任何帮助将非常感激！

Thx, A

谢谢，A

Answer 1

回答by Happy001

You have PerfectSeparationErrorbecause your loansData['IR_TF'] only has a single value True(or 1). You first converted interest rate from % to decimal, so you should define IR_TF as

你有，PerfectSeparationError因为你的贷款数据 ['IR_TF'] 只有一个值True（或 1）。您首先将利率从 % 转换为十进制，因此您应该将 IR_TF 定义为

loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map

Then your regression will run successfully:

然后您的回归将成功运行：

Optimization terminated successfully.
         Current function value: 0.319503
         Iterations 8
FICO.Score           0.087423
Amount.Requested    -0.000174
Intercept          -60.125045
dtype: float64

Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .mapmight not be as fast as vectorized calculations. Here are my suggestions:

此外，我注意到许多可以更容易阅读和/或获得一些性能改进的地方.map可能不如矢量化计算那么快。以下是我的建议：

from scipy import stats
import numpy as np
import pandas as pd 
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm

loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')

## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0

loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int)  --> loanlength not used below

loansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)

#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12

#create intercept column
loansData['Intercept'] = 1.0

# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] 

#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])

#fit the model
result = logit.fit()

#get fitted coef
coeff = result.params

#print coeff
print result.summary() #result has more information


Logit Regression Results                           
==============================================================================
Dep. Variable:                  IR_TF   No. Observations:                 2500
Model:                          Logit   Df Residuals:                     2497
Method:                           MLE   Df Model:                            2
Date:                Thu, 07 Jan 2016   Pseudo R-squ.:                  0.5243
Time:                        23:15:54   Log-Likelihood:                -798.76
converged:                       True   LL-Null:                       -1679.2
                                        LLR p-value:                     0.000
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
FICO.Score           0.0874      0.004     24.779      0.000         0.081     0.094
Amount.Requested    -0.0002    1.1e-05    -15.815      0.000        -0.000    -0.000
Intercept          -60.1250      2.420    -24.840      0.000       -64.869   -55.381
====================================================================================

By the way -- is this P2P lending data?

顺便说一句——这是P2P借贷数据吗？

pandas 由于“完美分离错误”而无法运行逻辑回归

提问by Ajay Gopalan

回答by Happy001

相关推荐

最近更新

标签

pandas 由于“完美分离错误”而无法运行逻辑回归

提问by Ajay Gopalan

回答by Happy001

相关推荐

pandas 熊猫：转换多索引数据帧中的索引类型

pandas 合并 geopandas 中的地理数据框（CRS 不匹配）

pandas 在熊猫中将列表转换为日期时间

如何将 Pandas Dataframe 写入 Django 模型

相关推荐

最近更新

标签