Scikit 和 Pandas:拟合大数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11707805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 15:46:50  来源:igfitidea点击:

Scikit and Pandas: Fitting Large Data

memorypandasmachine-learningscikit-learnclassification

提问by Ji Park

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?

如何使用 scikit-learn 在大型 csv 数据(~75MB)上训练模型而不会遇到内存问题?

I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.

我使用 IPython notebook 作为编程环境,使用 pandas+sklearn 包来分析来自 kaggle 数字识别器教程的数据。

The data is available on the webpage, link to my code, and here is the error message:

数据在网页上可用,链接到我的代码,这里是错误消息

KNeighborsClassifieris used for the prediction.

KNeighborsClassifier用于预测。

Problem:

问题:

"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.

使用 read_csv 函数加载大型数据集时发生“MemoryError”。为了暂时绕过这个问题,我必须重新启动内核,然后 read_csv 函数成功加载了文件,但是当我再次运行同一个单元格时出现同样的错误。

When the read_csvfunction loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.

read_csv函数成功加载文件时,在对 进行更改后dataframe,我可以将特征和标签传递给 KNeighborsClassifier 的 fit() 函数。此时,会出现类似的内存错误。

I tried the following:

我尝试了以下方法:

Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.

以块的形式遍历 CSV 文件,并相应地拟合数据,但问题是每次都会为一大块数据覆盖预测模型。

What do you think I can do to successfully train my model without running into memory problems?

你认为我可以做些什么来成功训练我的模型而不会遇到内存问题?

采纳答案by ogrisel

Note: when you load the data with pandas it will create a DataFrameobject where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).

注意:当您使用 Pandas 加载数据时,它将创建一个DataFrame对象,其中每一列的所有行都具有同构数据类型,但 2 列可以具有不同的数据类型(例如整数、日期、字符串)。

When you pass a DataFrameinstance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.

当您将DataFrame实例传递给 scikit-learn 模型时,它将首先分配一个具有 dtype np.float32 或 np.float64(取决于模型的实现)的同构 2D numpy 数组。此时,您将在内存中拥有 2 个数据集副本。

To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxtfor instance (have a look at the docstring for the parameters).

为了避免这种情况,您可以编写/重用一个 CSV 解析器,该解析器直接以 scikit-learn 模型期望的内部格式/dtype 分配数据。numpy.loadtxt例如,您可以尝试(查看参数的文档字符串)。

Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparseparser.

此外,如果您的数据非常稀疏(许多零值),最好使用 scipy.sparse 数据结构和可以处理此类输入格式的 scikit-learn 模型(查看文档字符串以了解)。然而,CSV 格式本身不太适合稀疏数据,我不确定是否存在直接的 CSV-to- scipy.sparseparser。

Edit:for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train)which is very wasteful when only (n_samples_predict, n_neighbors)is needed instead. This issue can be tracked here:

编辑:供参考 KNearestNeighborsClassifer 分配临时距离数组的形状(n_samples_predict, n_samples_train),这在仅(n_samples_predict, n_neighbors)需要时非常浪费。可以在此处跟踪此问题:

https://github.com/scikit-learn/scikit-learn/issues/325

https://github.com/scikit-learn/scikit-learn/issues/325