Python 训练 SVM 分类器需要多长时间?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18165213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How much time does take train SVM classifier?
提问by Il'ya Zhenin
I wrote following code and test it on small data:
我编写了以下代码并在小数据上进行测试:
classif = OneVsRestClassifier(svm.SVC(kernel='rbf'))
classif.fit(X, y)
Where X, y(X - 30000x784 matrix, y - 30000x1) are numpy arrays. On small data algorithm works well and give me right results.
其中X, y(X - 30000x784 矩阵,y - 30000x1) 是 numpy 数组。在小数据算法上运行良好,并给我正确的结果。
But I run my program about 10 hours ago... And it is still in process.
但是我大约在 10 小时前运行了我的程序......它仍在进行中。
I want to know how long it will take, or it stuck in some way? (Laptop specs 4 GB Memory, Core i5-480M)
我想知道需要多长时间,或者它以某种方式卡住了?(笔记本电脑规格 4 GB 内存,Core i5-480M)
回答by lejlot
SVM training can be arbitrary long, this depends on dozens of parameters:
SVM 训练可以任意长,这取决于几十个参数:
Cparameter - greater the missclassification penalty, slower the process- kernel - more complicated the kernel, slower the process (rbf is the most complex from the predefined ones)
- data size/dimensionality - again, the same rule
C参数 - 错误分类惩罚越大,过程越慢- 内核 - 内核越复杂,进程越慢(rbf 是预定义内核中最复杂的)
- 数据大小/维度——同样的规则
in general, basic SMO algorithm is O(n^3), so in case of 30 000datapoints it has to run number of operations proportional to the2 700 000 000 000which is realy huge number. What are your options?
一般来说,基本的 SMO 算法是O(n^3),所以在30 000数据点的情况下,它必须运行与 成比例的操作数量,2 700 000 000 000这确实是一个巨大的数字。你有哪些选择?
- change a kernel to the linear one, 784 features is quite a lot, rbf can be redundant
- reduce features' dimensionality (PCA?)
- lower the
Cparameter - train model on the subset of your data to find the good parameters and then train the whole one on some cluster/supercomputer
- 将内核改成线性内核,784个特征相当多,rbf可以是多余的
- 减少特征的维数(PCA?)
- 降低
C参数 - 在数据子集上训练模型以找到好的参数,然后在某个集群/超级计算机上训练整个模型

