如何在 Scikit python 中提高逻辑回归的模型精度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38077190/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to increase the model accuracy of logistic regression in Scikit python?
提问by Aby Mathew
I am trying to predict the admit variable with predictors such as gre,gpa and ranks.But the prediction accuracy is very less(0.66).The dataset is given below. https://gist.github.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a
我试图用gre、gpa和ranks等预测变量来预测admission变量。但预测精度非常低(0.66)。数据集如下。 https://gist.github.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a
Please find the codes below:
请找到以下代码:
In[73]: data.head(20)
Out[73]:
admit gre gpa rank_2 rank_3 rank_4
0 0 380 3.61 0.0 1.0 0.0
1 1 660 3.67 0.0 1.0 0.0
2 1 800 4.00 0.0 0.0 0.0
3 1 640 3.19 0.0 0.0 1.0
4 0 520 2.93 0.0 0.0 1.0
5 1 760 3.00 1.0 0.0 0.0
6 1 560 2.98 0.0 0.0 0.0
y = data['admit']
x = data[data.columns[1:]]
from sklearn.cross_validation import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,random_state=2)
ytrain=np.ravel(ytrain)
#modelling
clf = LogisticRegression(penalty='l2')
clf.fit(xtrain,ytrain)
ypred_train = clf.predict(xtrain)
ypred_test = clf.predict(xtest)
In[38]: #checking the classification accuracy
accuracy_score(ytrain,ypred_train)
Out[38]: 0.70333333333333337
In[39]: accuracy_score(ytest,ypred_test)
Out[39]: 0.66000000000000003
In[78]: #confusion metrix...
from sklearn.metrics import confusion_matrix
confusion_matrix(ytest,ypred)
Out[78]:
array([[62, 1],
[33, 4]])
The ones are wrongly predicting.How to increase the model accuracy?
那些是错误的预测。如何提高模型的准确性?
回答by Abhinav Arora
Since machine learning is more about experimenting with the features and the models, there is no correct answer to your question. Some of my suggestions to you would be:
由于机器学习更多是关于对特征和模型进行试验,因此您的问题没有正确答案。我给你的一些建议是:
1. Feature Scaling and/or Normalization- Check the scales of your greand gpafeatures. They differ on 2 orders of magnitude. Therefore, your grefeature will end up dominating the others in a classifier like Logistic Regression. You can normalize all your features to the same scale before putting them in a machine learning model.Thisis a good guide on the various feature scaling and normalization classes available in scikit-learn.
1. 特征缩放和/或标准化- 检查您的gre和gpa特征的比例。它们相差 2 个数量级。因此,您的gre特征最终将在逻辑回归等分类器中主导其他特征。在将它们放入机器学习模型之前,您可以将所有特征标准化为相同的比例。这是关于 scikit-learn 中可用的各种特征缩放和归一化类的很好的指南。
2. Class Imbalance- Look for class imbalance in your data. Since you are working with admit/reject data, then the number of rejects would be significantly higher than the admits. Most classifiers in SkLearn including LogisticRegression
have a class_weight
parameter. Setting that to balanced
might also work well in case of a class imbalance.
2. 类不平衡- 在您的数据中查找类不平衡。由于您正在处理承认/拒绝数据,因此拒绝的数量将明显高于承认的数量。SkLearn 中的大多数分类器包括LogisticRegression
一个class_weight
参数。balanced
在类别不平衡的情况下,将其设置为也可能会很好地工作。
3. Optimize other scores- You can optimize on other metrics also such as Log Lossand F1-Score. The F1-Score could be useful, in case of class imbalance. Thisis a good guide that talks more about scoring.
3. 优化其他分数- 您还可以优化其他指标,例如Log Loss和F1-Score。在类别不平衡的情况下,F1-Score 可能很有用。这是一个很好的指南,更多地讨论了得分。
4. Hyperparameter Tuning - Grid Search- You can improve your accuracy by performing a Grid Search to tune the hyperparameters of your model. For example in case of LogisticRegression
, the parameter C
is a hyperparameter. Also, you should avoid using the test data during grid search. Instead perform cross validation. Use your test data only to report the final numbers for your final model. Please note that GridSearch should be done for all models that you try because then only you will be able to tell what is the best you can get from each model. Scikit-Learn provides the GridSearchCV
class for this. Thisarticle is also a good starting point.
4. 超参数调整 - 网格搜索- 您可以通过执行网格搜索来调整模型的超参数来提高准确性。例如,在 的情况下LogisticRegression
,参数C
是超参数。此外,您应该避免在网格搜索期间使用测试数据。而是执行交叉验证。仅使用您的测试数据来报告最终模型的最终数字。请注意,应该对您尝试的所有模型进行 GridSearch,因为只有您才能判断从每个模型中可以获得的最佳结果。Scikit-LearnGridSearchCV
为此提供了类。这篇文章也是一个很好的起点。
5. Explore more classifiers- Logistic Regression learns a linear decision surface that separates your classes. It could be possible that your 2 classes may not be linearly separable. In such a case you might need to look at other classifiers such Support Vector Machineswhich are able to learn more complex decision boundaries. You can also start looking at Tree-Based classifiers such as Decision Treeswhich can learn rules from your data. Think of them as a series of If-Else rules which the algorithm automatically learns from the data. Often, it is difficult to get the right Bias-Variance Tradeoffwith Decision Trees, so I would recommend you to look at Random Forestsif you have a considerable amount of data.
5. 探索更多分类器- Logistic Regression 学习一个线性决策表面,将您的类分开。您的 2 个类可能不是线性可分的。在这种情况下,您可能需要查看其他分类器,例如支持向量机,它们能够学习更复杂的决策边界。您还可以开始查看基于树的分类器,例如可以从您的数据中学习规则的决策树。将它们视为算法自动从数据中学习的一系列 If-Else 规则。通常,很难通过决策树获得正确的偏差-方差权衡,因此如果您有大量数据,我建议您查看随机森林。
6. Error Analysis- For each of your models, go back and look at the cases where they are failing. You might end up finding that some of your models work well on one part of the parameter space while others work better on other parts. If this is the case, then Ensemble Techniquessuch as VotingClassifier
techniques often give the best results. Models that win Kaggle competitions are many times ensemble models.
6. 错误分析- 对于您的每个模型,返回并查看它们失败的情况。您最终可能会发现您的某些模型在参数空间的一部分上运行良好,而其他模型在其他部分运行得更好。如果是这种情况,那么Ensemble Techniques 之类的VotingClassifier
技巧通常会给出最好的结果。赢得 Kaggle 比赛的模型是多次集成模型。
7. More Features_ If all of this fails, then that means that you should start looking for more features.
7. 更多功能_ 如果所有这些都失败了,那么这意味着您应该开始寻找更多功能。
Hope that helps!
希望有帮助!