Scikit 和 Pandas:拟合大数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11707805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scikit and Pandas: Fitting Large Data
提问by Ji Park
How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
如何使用 scikit-learn 在大型 csv 数据(~75MB)上训练模型而不会遇到内存问题?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
我使用 IPython notebook 作为编程环境,使用 pandas+sklearn 包来分析来自 kaggle 数字识别器教程的数据。
The data is available on the webpage, link to my code, and here is the error message:
KNeighborsClassifieris used for the prediction.
KNeighborsClassifier用于预测。
Problem:
问题:
"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.
使用 read_csv 函数加载大型数据集时发生“MemoryError”。为了暂时绕过这个问题,我必须重新启动内核,然后 read_csv 函数成功加载了文件,但是当我再次运行同一个单元格时出现同样的错误。
When the read_csvfunction loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
当read_csv函数成功加载文件时,在对 进行更改后dataframe,我可以将特征和标签传递给 KNeighborsClassifier 的 fit() 函数。此时,会出现类似的内存错误。
I tried the following:
我尝试了以下方法:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
以块的形式遍历 CSV 文件,并相应地拟合数据,但问题是每次都会为一大块数据覆盖预测模型。
What do you think I can do to successfully train my model without running into memory problems?
你认为我可以做些什么来成功训练我的模型而不会遇到内存问题?
采纳答案by ogrisel
Note: when you load the data with pandas it will create a DataFrameobject where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
注意:当您使用 Pandas 加载数据时,它将创建一个DataFrame对象,其中每一列的所有行都具有同构数据类型,但 2 列可以具有不同的数据类型(例如整数、日期、字符串)。
When you pass a DataFrameinstance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
当您将DataFrame实例传递给 scikit-learn 模型时,它将首先分配一个具有 dtype np.float32 或 np.float64(取决于模型的实现)的同构 2D numpy 数组。此时,您将在内存中拥有 2 个数据集副本。
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxtfor instance (have a look at the docstring for the parameters).
为了避免这种情况,您可以编写/重用一个 CSV 解析器,该解析器直接以 scikit-learn 模型期望的内部格式/dtype 分配数据。numpy.loadtxt例如,您可以尝试(查看参数的文档字符串)。
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparseparser.
此外,如果您的数据非常稀疏(许多零值),最好使用 scipy.sparse 数据结构和可以处理此类输入格式的 scikit-learn 模型(查看文档字符串以了解)。然而,CSV 格式本身不太适合稀疏数据,我不确定是否存在直接的 CSV-to- scipy.sparseparser。
Edit:for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train)which is very wasteful when only (n_samples_predict, n_neighbors)is needed instead. This issue can be tracked here:
编辑:供参考 KNearestNeighborsClassifer 分配临时距离数组的形状(n_samples_predict, n_samples_train),这在仅(n_samples_predict, n_neighbors)需要时非常浪费。可以在此处跟踪此问题:

