Python 如何在带有朴素贝叶斯分类器和 NLTK 的 scikit 中使用 k 折交叉验证

Question

提问by user2284345

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.

我有一个小语料库，我想使用 10 倍交叉验证来计算朴素贝叶斯分类器的准确性，该怎么做。

Answer 1

回答by Jared

Your options are to either set this up yourself or use something like NLTK-Trainersince NLTK doesn't directly support cross-validation for machine learning algorithms.

您的选择是自己设置或使用NLTK-Trainer 之类的东西，因为 NLTK不直接支持机器学习算法的交叉验证。

I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.

我建议您可能只使用另一个模块来为您执行此操作，但如果您真的想编写自己的代码，则可以执行以下操作。

Supposing you want 10-fold, you would have to partition your training set into 10subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).

假设您想要10-fold，您必须将您的训练集划分为10子集，训练9/10，测试剩余的1/10，并对每个子集组合 ( 10)执行此操作。

Assuming your training set is in a list named training, a simple way to accomplish this would be,

假设您的训练集在名为的列表中training，完成此操作的简单方法是，

num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
    testing_this_round = training[i*subset_size:][:subset_size]
    training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
    # train using training_this_round
    # evaluate against testing_this_round
    # save accuracy

# find mean accuracy over all rounds

Answer 2

回答by user2284345

I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:

我已经将库和 NLTK 用于 naivebayes sklearn 进行交叉验证，如下所示：

import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)

for traincv, testcv in cv:
    classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
    print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])

and at the end I calculated the average accuracy

最后我计算了平均准确率

Answer 3

回答by user3236650

Modified the second answer:

修改了第二个答案：

cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)

Answer 4

回答by Victor

Inspired from Jared's answer, here is a version using a generator:

受Jared's answer 的启发，这是一个使用生成器的版本：

def k_fold_generator(X, y, k_fold):
    subset_size = len(X) / k_fold  # Cast to int if using Python 3
    for k in range(k_fold):
        X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
        X_valid = X[k * subset_size:][:subset_size]
        y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
        y_valid = y[k * subset_size:][:subset_size]

        yield X_train, y_train, X_valid, y_valid

I am assuming that your data set Xhas N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.

我假设您的数据集X有 N 个数据点（示例中 = 4）和 D 个特征（示例中 = 2）。关联的 N 个标签存储在y.

X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2

for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
    # Train using X_train and y_train
    # Evaluate using X_valid and y_valid

Answer 5

回答by Salvador Dali

Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).

实际上，不需要在最受好评的答案中提供的长循环迭代。分类器的选择也无关紧要（它可以是任何分类器）。

Scikit provides cross_val_score, which does all the looping under the hood.

Scikit 提供cross_val_score，它在引擎盖下完成所有循环。

from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)

Python 如何在带有朴素贝叶斯分类器和 NLTK 的 scikit 中使用 k 折交叉验证

提问by user2284345

回答by Jared

回答by user2284345

回答by user3236650

回答by Victor

回答by Salvador Dali

相关推荐

最近更新

标签

Python 如何在带有朴素贝叶斯分类器和 NLTK 的 scikit 中使用 k 折交叉验证

提问by user2284345

回答by Jared

回答by user2284345

回答by user3236650

回答by Victor

回答by Salvador Dali

相关推荐

Python 为什么我的 Pandas 'apply' 函数不能引用多列？

Python 将日期时间插入 MySql 数据库

Python 如何在 virtualenv 中安装包？

Python Pandas 错误：'DataFrame' 对象没有属性 'loc'

相关推荐

最近更新

标签