Scikit 和 Pandas：拟合大数据

Question

提问by Ji Park

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?

如何使用 scikit-learn 在大型 csv 数据（~75MB）上训练模型而不会遇到内存问题？

I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.

我使用 IPython notebook 作为编程环境，使用 pandas+sklearn 包来分析来自 kaggle 数字识别器教程的数据。

The data is available on the webpage, link to my code, and here is the error message:

数据在网页上可用，链接到我的代码，这里是错误消息：

KNeighborsClassifieris used for the prediction.

KNeighborsClassifier用于预测。

Problem:

问题：

"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.

使用 read_csv 函数加载大型数据集时发生“MemoryError”。为了暂时绕过这个问题，我必须重新启动内核，然后 read_csv 函数成功加载了文件，但是当我再次运行同一个单元格时出现同样的错误。

When the read_csvfunction loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.

当read_csv函数成功加载文件时，在对进行更改后dataframe，我可以将特征和标签传递给 KNeighborsClassifier 的 fit() 函数。此时，会出现类似的内存错误。

I tried the following:

我尝试了以下方法：

Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.

以块的形式遍历 CSV 文件，并相应地拟合数据，但问题是每次都会为一大块数据覆盖预测模型。

What do you think I can do to successfully train my model without running into memory problems?

你认为我可以做些什么来成功训练我的模型而不会遇到内存问题？

Answer 1

采纳答案by ogrisel

Note: when you load the data with pandas it will create a DataFrameobject where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).

注意：当您使用 Pandas 加载数据时，它将创建一个DataFrame对象，其中每一列的所有行都具有同构数据类型，但 2 列可以具有不同的数据类型（例如整数、日期、字符串）。

When you pass a DataFrameinstance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.

当您将DataFrame实例传递给 scikit-learn 模型时，它将首先分配一个具有 dtype np.float32 或 np.float64（取决于模型的实现）的同构 2D numpy 数组。此时，您将在内存中拥有 2 个数据集副本。

To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxtfor instance (have a look at the docstring for the parameters).

为了避免这种情况，您可以编写/重用一个 CSV 解析器，该解析器直接以 scikit-learn 模型期望的内部格式/dtype 分配数据。numpy.loadtxt例如，您可以尝试（查看参数的文档字符串）。

Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparseparser.

此外，如果您的数据非常稀疏（许多零值），最好使用 scipy.sparse 数据结构和可以处理此类输入格式的 scikit-learn 模型（查看文档字符串以了解）。然而，CSV 格式本身不太适合稀疏数据，我不确定是否存在直接的 CSV-to- scipy.sparseparser。

Edit:for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train)which is very wasteful when only (n_samples_predict, n_neighbors)is needed instead. This issue can be tracked here:

编辑：供参考 KNearestNeighborsClassifer 分配临时距离数组的形状(n_samples_predict, n_samples_train)，这在仅(n_samples_predict, n_neighbors)需要时非常浪费。可以在此处跟踪此问题：

https://github.com/scikit-learn/scikit-learn/issues/325

Scikit 和 Pandas：拟合大数据

提问by Ji Park

采纳答案by ogrisel

相关推荐

最近更新

标签

Scikit 和 Pandas：拟合大数据

提问by Ji Park

采纳答案by ogrisel

相关推荐

日期上的 Pandas 数据透视表

pandas 熊猫：简单的“加入”不起作用？

Pandas：难以理解合并的工作原理

pandas 从 MultiIndex 中选择特定级别的数据

相关推荐

最近更新

标签