在python中处理一个非常非常大的数据集-内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14551451/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:49:16  来源:igfitidea点击:

Processing a very very big data set in python - memory error

pythonnumpypython-2.7data-analysis

提问by maheshakya

I'm trying to process data obtained from a csv file using csv module in python. there are about 50 columns & 401125 rows in this. I used the following code chunk to put that data into a list

我正在尝试使用 python 中的 csv 模块处理从 csv 文件中获取的数据。其中大约有 50 列和 401125 行。我使用以下代码块将该数据放入列表中

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
data = []
for row in csv_file_object:
    data.append(row)

I can get length of this list using len(data) & it returns 401125. I can even get each individual record by calling list indices. But when I try to get the size of the list by calling np.size(data) (I imported numpy as np) I get the following stack trace.

我可以使用 len(data) 获取此列表的长度并返回 401125。我什至可以通过调用列表索引来获取每个单独的记录。但是当我尝试通过调用 np.size(data) (我将 numpy 作为 np 导入)来获取列表的大小时,我得到以下堆栈跟踪。

MemoryError Traceback (most recent call last) in () ----> 1 np.size(data)

C:\Python27\lib\site-packages\numpy\core\fromnumeric.pyc in size(a, axis) 2198 return a.size 2199 except AttributeError: -> 2200 return asarray(a).size 2201 else: 2202 try:

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 233 234 """ --> 235 return array(a, dtype, copy=False, order=order) 236 237 def asanyarray(a, dtype=None, order=None):

MemoryError:

MemoryError Traceback (最近一次调用最后一次) in () ----> 1 np.size(data)

C:\Python27\lib\site-packages\numpy\core\fromnumeric.pyc in size(a,axis) 2198 返回 a.size 2199,除了 AttributeError: -> 2200 return asarray(a).size 2201 else: 2202 try:

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 233 234 """ --> 235 return array(a, dtype, copy=False, order=order ) 236 237 def asanyarray(a, dtype=None, order=None):

内存错误:

I can't even divide that list into a multiple parts using list indices or convert this list into a numpy array. It give this same memory error.

我什至无法使用列表索引将该列表分成多个部分,也无法将此列表转换为 numpy 数组。它给出了同样的内存错误。

how can I deal with this kind of big data sample. Is there any other way to process large data sets like this one.

我该如何处理这种大数据样本。有没有其他方法来处理像这样的大型数据集。

I'm using ipython notebook in windows 7 professional.

我在 Windows 7 Professional 中使用 ipython notebook。

采纳答案by Dougal

As noted by @DSM in the comments, the reason you're getting a memory error is that calling np.sizeon a list will copy the data into an array first and then get the size.

正如@DSM 在评论中所指出的,您收到内存错误的原因是调用np.size列表将首先将数据复制到数组中,然后获取大小。

If you don't need to work with it as a numpy array, just don't call np.size. If you do want numpy-like indexing options and so on, you have a few options.

如果您不需要将其作为 numpy 数组使用,则不要调用np.size. 如果你确实想要类似 numpy 的索引选项等等,你有几个选择。

You could use pandas, which is meant for handling big not-necessarily-numerical datasets and has some great helpers and stuff for doing so.

您可以使用pandas,它用于处理大型的、不必要的数字数据集,并且有一些很棒的助手和东西可以这样做。

If you don't want to do that, you could define a numpy structure arrayand populate it line-by-line in the first place rather than making a list and copying into it. Something like:

如果您不想这样做,您可以首先定义一个 numpy结构数组并逐行填充它,而不是制作一个列表并复制到其中。就像是:

fields = [('name1', str), ('name2', float), ...]
data = np.zeros((num_rows,), dtype=fields)

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
for i, row in enumerate(csv_file_object):
    data[i] = row

You could also define fieldsbased on headerso you don't have to manually type out all 50 column names, though you'd have to do something about specifying the data types for each.

您还可以定义fields基于,header这样您就不必手动输入所有 50 个列名,尽管您必须为每个列指定数据类型。