Python 为什么 scikit-learn SVM.SVC() 非常慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40077432/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:08:50  来源:igfitidea点击:

Why is scikit-learn SVM.SVC() extremely slow?

pythonscikit-learnsvm

提问by C. Gary

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and na?ve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?

我尝试使用 SVM 分类器来训练包含大约 10 万个样本的数据,但我发现它非常慢,甚至在两个小时后也没有响应。当数据集有大约 1k 个样本时,我可以立即得到结果。我还尝试了 SGDClassifier 和朴素贝叶斯,它们非常快,我在几分钟内就得到了结果。你能解释一下这个现象吗?

回答by sascha

General remarks about SVM-learning

关于 SVM 学习的一般评论

SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features)link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithmused within libsvm, which is the core-solver in sklearn for this type of problem.

使用非线性内核的 SVM 训练,这是 sklearn 的 SVC 中的默认设置,大约是复杂性的:O(n_samples^2 * n_features)链接到一些问题,该近似值由 sklearn 的开发人员之一给出。这适用于libsvm 中使用的SMO 算法,它是 sklearn 中针对此类问题的核心求解器。

This changes much when no kernels are used and one uses sklearn.svm.LinearSVC(based on liblinear) or sklearn.linear_model.SGDClassifier.

当不使用内核并且使用sklearn.svm.LinearSVC(基于liblinear)或sklearn.linear_model.SGDClassifier时,这会发生很大变化。

So we can do some math to approximate the time-difference between 1k and 100k samples:

所以我们可以做一些数学计算来近似 1k 和 100k 样本之间的时间差:

1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!

This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!

这只是一个近似值,甚至可能更糟或更糟(例如设置缓存大小;权衡内存以提高速度)!

Scikit-learn specific remarks

Scikit-learn 具体备注

The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.

情况也可能更加复杂,因为 scikit-learn 在幕后为我们做的所有好事。以上适用于经典的 2-class SVM。如果您有机会尝试学习一些多类数据;scikit-learn 将自动使用 OneVsRest 或 OneVsAll 方法来执行此操作(因为核心 SVM 算法不支持此操作)。阅读 scikit-learns 文档以了解这部分内容。

The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scalingwhich will take a lot of time too!

同样的警告也适用于生成概率:SVM 不会自然地生成最终预测的概率。因此,要使用这些(由参数激活),scikit-learn 使用称为Platt 缩放的繁重交叉验证程序,这也将花费大量时间!

Scikit-learn documentation

Scikit-learn 文档

Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):

因为 sklearn 拥有最好的文档之一,所以这些文档中通常有很好的部分来解释类似的内容(链接):

enter image description here

在此处输入图片说明