Python 训练 SVM 分类器需要多长时间?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18165213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:01:13  来源:igfitidea点击:

How much time does take train SVM classifier?

pythonnumpymachine-learningsvm

提问by Il'ya Zhenin

I wrote following code and test it on small data:

我编写了以下代码并在小数据上进行测试:

classif = OneVsRestClassifier(svm.SVC(kernel='rbf'))
classif.fit(X, y)

Where X, y(X - 30000x784 matrix, y - 30000x1) are numpy arrays. On small data algorithm works well and give me right results.

其中X, y(X - 30000x784 矩阵,y - 30000x1) 是 numpy 数组。在小数据算法上运行良好,并给我正确的结果。

But I run my program about 10 hours ago... And it is still in process.

但是我大约在 10 小时前运行了我的程序......它仍在进行中。

I want to know how long it will take, or it stuck in some way? (Laptop specs 4 GB Memory, Core i5-480M)

我想知道需要多长时间,或者它以某种方式卡住了?(笔记本电脑规格 4 GB 内存,Core i5-480M)

回答by lejlot

SVM training can be arbitrary long, this depends on dozens of parameters:

SVM 训练可以任意长,这取决于几十个参数:

  • Cparameter - greater the missclassification penalty, slower the process
  • kernel - more complicated the kernel, slower the process (rbf is the most complex from the predefined ones)
  • data size/dimensionality - again, the same rule
  • C参数 - 错误分类惩罚越大,过程越慢
  • 内核 - 内核越复杂,进程越慢(rbf 是预定义内核中最复杂的)
  • 数据大小/维度——同样的规则

in general, basic SMO algorithm is O(n^3), so in case of 30 000datapoints it has to run number of operations proportional to the2 700 000 000 000which is realy huge number. What are your options?

一般来说,基本的 SMO 算法是O(n^3),所以在30 000数据点的情况下,它必须运行与 成比例的操作数量,2 700 000 000 000这确实是一个巨大的数字。你有哪些选择?

  • change a kernel to the linear one, 784 features is quite a lot, rbf can be redundant
  • reduce features' dimensionality (PCA?)
  • lower the Cparameter
  • train model on the subset of your data to find the good parameters and then train the whole one on some cluster/supercomputer
  • 将内核改成线性内核,784个特征相当多,rbf可以是多余的
  • 减少特征的维数(PCA?)
  • 降低C参数
  • 在数据子集上训练模型以找到好的参数,然后在某个集群/超级计算机上训练整个模型