使 SVM 在 Python 中运行得更快
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31681373/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Making SVM run faster in python
提问by Abhishek Bhatia
Using the codebelow for svm in python:
在 python 中使用下面的 svm代码:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)
But it is taking a huge amount of time.
但这需要大量时间。
Actual Data Dimensions:
实际数据维度:
train-set (1422392,29)
test-set (233081,29)
How can I speed it up(parallel or some other way)? Please help. I have already tried PCA and downsampling.
我怎样才能加快速度(并行或其他方式)?请帮忙。我已经尝试过 PCA 和下采样。
I have 6 classes. Edit: Found http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.htmlbut I wish for probability estimates and it seems not to so for svm.
我有6节课。编辑:找到http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html但我希望进行概率估计,而 svm 似乎并非如此。
Edit:
编辑:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math
def new_func(a): #converts array(x) elements to (1/(1 + e(-x)))
a=1/(1 + math.exp(-a))
return a
if __name__ == '__main__':
iris = datasets.load_iris()
cores=multiprocessing.cpu_count()-2
X, y = iris.data, iris.target #loading dataset
C_range = 10.0 ** np.arange(-4, 4); #c value range
param_grid = dict(estimator__C=C_range.tolist())
svr = OneVsRestClassifier(LinearSVC(class_weight='auto'),n_jobs=cores) ################LinearSVC Code faster
#svr = OneVsRestClassifier(SVC(kernel='linear', probability=True, ##################SVC code slow
# class_weight='auto'),n_jobs=cores)
clf = grid_search.GridSearchCV(svr, param_grid,n_jobs=cores,verbose=2) #grid search
clf.fit(X, y) #training svm model
decisions=clf.decision_function(X) #outputs decision functions
#prob=clf.predict_proba(X) #only for SVC outputs probablilites
print decisions[:5,:]
vecfunc = np.vectorize(new_func)
prob=vecfunc(decisions) #converts deicision to (1/(1 + e(-x)))
print prob[:5,:]
Edit 2:The answer by user3914041 yields very poor probability estimates.
编辑 2:user3914041 的答案产生的概率估计非常差。
采纳答案by Alexander Bauer
If you want to stick with SVC as much as possible and train on the full dataset, you can use ensembles of SVCs that are trained on subsets of the data to reduce the number of records per classifier (which apparently has quadratic influence on complexity). Scikit supports that with the BaggingClassifier
wrapper. That should give you similar (if not better) accuracy compared to a single classifier, with much less training time. The training of the individual classifiers can also be set to run in parallel using the n_jobs
parameter.
如果您想尽可能坚持使用 SVC 并在完整数据集上进行训练,您可以使用在数据子集上训练的 SVC 集合来减少每个分类器的记录数(这显然对复杂性具有二次影响)。Scikit 通过BaggingClassifier
包装器支持这一点。与单个分类器相比,这应该可以为您提供相似(如果不是更好)的准确度,并且训练时间要少得多。还可以使用n_jobs
参数将各个分类器的训练设置为并行运行。
Alternatively, I would also consider using a Random Forest classifier - it supports multi-class classification natively, it is fast and gives pretty good probability estimates when min_samples_leaf
is set appropriately.
或者,我也会考虑使用随机森林分类器 - 它本身支持多类分类,速度很快,并且在min_samples_leaf
适当设置时提供非常好的概率估计。
I did a quick tests on the iris dataset blown up 100 times with an ensemble of 10 SVCs, each one trained on 10% of the data. It is more than 10 times faster than a single classifier. These are the numbers I got on my laptop:
我对 iris 数据集进行了快速测试,该数据集由 10 个 SVC 组成,每个数据集都对 10% 的数据进行了训练。它比单个分类器快 10 倍以上。这些是我在笔记本电脑上得到的数字:
Single SVC: 45s
单个 SVC:45 秒
Ensemble SVC: 3s
合奏 SVC:3 秒
Random Forest Classifier: 0.5s
随机森林分类器:0.5s
See below the code that I used to produce the numbers:
请参阅下面我用来生成数字的代码:
import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
X = np.repeat(X, 100, axis=0)
y = np.repeat(y, 100, axis=0)
start = time.time()
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
end = time.time()
print "Single SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
n_estimators = 10
start = time.time()
clf = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf.fit(X, y)
end = time.time()
print "Bagging SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
start = time.time()
clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X, y)
end = time.time()
print "Random Forest", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
If you want to make sure that each record is used only once for training in the BaggingClassifier
, you can set the bootstrap
parameter to False.
如果要确保每个记录只用于一次训练BaggingClassifier
,可以将该bootstrap
参数设置为 False。
回答by ldirer
SVM classifiers don't scale so easily. From the docs, about the complexity of sklearn.svm.SVC
.
SVM 分类器不那么容易扩展。从文档中,关于sklearn.svm.SVC
.
The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
拟合时间复杂度超过样本数量的二次方,这使得很难扩展到具有超过 10000 个样本的数据集。
In scikit-learn you have svm.linearSVC
which can scale better.
Apparently it could be able to handle your data.
在 scikit-learn 中,你svm.linearSVC
可以更好地扩展。显然它可以处理您的数据。
Alternatively you could just go with another classifier. If you want probability estimates I'd suggest logistic regression. Logistic regression also has the advantage of not needing probability calibrationto output 'proper' probabilities.
或者,您可以使用另一个分类器。如果您想要概率估计,我建议您使用逻辑回归。Logistic 回归还具有不需要概率校准来输出“正确”概率的优点。
Edit:
编辑:
I did not know about linearSVC
complexity, finally I found information in the user guide:
我不知道linearSVC
复杂性,最后我在用户指南中找到了信息:
Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.
另请注意,对于线性情况,liblinear 实现在 LinearSVC 中使用的算法比基于 libsvm 的 SVC 对应物更有效,并且几乎可以线性扩展到数百万个样本和/或特征。
To get probability out of a linearSVC
check out this link. It is just a couple links away from the probability calibration guide I linked above and contains a way to estimate probabilities.
Namely:
要从linearSVC
检查中获得概率,请查看此链接。它与我上面链接的概率校准指南只有几个链接,并且包含一种估计概率的方法。即:
prob_pos = clf.decision_function(X_test)
prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
Note the estimates will probably be poor without calibration, as illustrated in the link.
请注意,如链接中所示,如果没有校准,估计值可能会很差。
回答by serv-inc
It was briefly mentioned in the top answer; here is the code: The quickest way to do this is via the n_jobs
parameter: replace the line
上面的回答中简要提到了这一点;这里是代码:要做到这一点,最快的方法就是通过对n_jobs
参数:替代线
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
with
和
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), n_jobs=-1)
This will use all available CPUs on your Computer, while still doing the same computation as before.
这将使用您计算机上所有可用的 CPU,同时仍执行与以前相同的计算。
回答by Andreas Mueller
You can use the kernel_approximation
moduleto scale up SVMs to a large number of samples like this.
您可以使用该kernel_approximation
模块将 SVM 扩展到像这样的大量样本。
回答by Yuhang Lin
Some answers mentioned using class_weight == 'auto'
. For sklearn version higher than 0.17, please use class_weight == 'balanced'
instead:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
使用class_weight == 'auto'
. 对于高于 0.17 的 sklearn 版本,请class_weight == 'balanced'
改用:https:
//scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html