Python 逻辑回归中的微调参数

Question

提问by Simon Kiely

I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?

我正在运行逻辑回归，在文本列上运行 tf-idf。这是我在逻辑回归中使用的唯一列。我如何确保尽可能好地调整此参数？

I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.

我希望能够运行一组步骤，最终让我可以说我的逻辑回归分类器正在尽可能地运行。

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')

print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
                      analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), use_idf=1, smooth_idf=1, 
                      sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                           C=1, fit_intercept=True, intercept_scaling=1.0, 
                           class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)

X = X_all[:lentrain]
X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."

Answer 1

采纳答案by lennon310

You can use grid search to find out the best Cvalue for you. Basically smaller Cspecify stronger regularization.

您可以使用网格搜索来找出最C适合您的值。基本上较小的C指定更强的正则化。

>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
             estimator=LogisticRegression(C=1.0, intercept_scaling=1,   
               dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})

See the GridSearchCv documentfor more details on your application.

有关您的应用程序的更多详细信息，请参阅GridSearchCv 文档。

Answer 2

回答by viplov

Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.

网格搜索是一种寻找最佳参数的残酷方法，因为它训练和测试每一种可能的组合。最好的方法是使用贝叶斯优化，它学习过去的评估分数并减少计算时间。

Python 逻辑回归中的微调参数

提问by Simon Kiely

采纳答案by lennon310

回答by viplov

相关推荐

最近更新

标签

Python 逻辑回归中的微调参数

提问by Simon Kiely

采纳答案by lennon310

回答by viplov

相关推荐

Python 打印 1-99 奇数的最有效代码

如何使 Python 格式浮动一定数量的有效数字？

Python 理解 time.perf_counter() 和 time.process_time()

Python 计算 Pandas Dataframe 索引之间的时间差

相关推荐

最近更新

标签