Python 逻辑回归中的微调参数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21816346/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fine-tuning parameters in Logistic Regression
提问by Simon Kiely
I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?
我正在运行逻辑回归,在文本列上运行 tf-idf。这是我在逻辑回归中使用的唯一列。我如何确保尽可能好地调整此参数?
I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.
我希望能够运行一组步骤,最终让我可以说我的逻辑回归分类器正在尽可能地运行。
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
                      analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), use_idf=1, smooth_idf=1, 
                      sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                           C=1, fit_intercept=True, intercept_scaling=1.0, 
                           class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
采纳答案by lennon310
You can use grid search to find out the best Cvalue for you. Basically smaller Cspecify stronger regularization. 
您可以使用网格搜索来找出最C适合您的值。基本上较小的C指定更强的正则化。
>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
             estimator=LogisticRegression(C=1.0, intercept_scaling=1,   
               dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})
See the GridSearchCv documentfor more details on your application.
有关您的应用程序的更多详细信息,请参阅GridSearchCv 文档。
回答by viplov
Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.
网格搜索是一种寻找最佳参数的残酷方法,因为它训练和测试每一种可能的组合。最好的方法是使用贝叶斯优化,它学习过去的评估分数并减少计算时间。

