Pandas read_csv() 1.2GB 文件在具有 140GB RAM 的 VM 上内存不足

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40454362/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:22:24  来源:igfitidea点击:

Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

pythonpandas

提问by David Frank

I am trying to read a CSV file of 1.2G, which contains 25K records, each consists of a id and a large string.

我正在尝试读取一个 1.2G 的 CSV 文件,其中包含 25K 条记录,每条记录都包含一个 id 和一个大字符串。

However, around 10K rows, I get this error:

但是,大约 10K 行,我收到此错误:

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

pandas.io.common.CParserError:标记数据时出错。C 错误:内存不足

Which seems weird, since the VM has 140GB RAM and at 10K rows the memory usage is only at ~1%.

这看起来很奇怪,因为 VM 有 140GB RAM 和 10K 行,内存使用率只有 ~1%。

This is the command I use:

这是我使用的命令:

pd.read_csv('file.csv', header=None, names=['id', 'text', 'code'])

I also ran the following dummy program, which could successfully fill up my memory to close to 100%.

我还运行了以下虚拟程序,它可以成功地将我的内存填充到接近 100%。

list = []
list.append("hello")
while True:
    list.append("hello" + list[len(list) - 1])

回答by kilojoules

This sounds like a job for chunksize. It splits the input process into multiple chunks, reducing the required reading memory.

这听起来像是chunksize. 它将输入过程分成多个块,减少了所需的读取内存。

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

回答by user8871302

This error can occur with an invalid csv file, rather than the stated memory error.

无效的 csv 文件可能会发生此错误,而不是声明的内存错误。

I got this error with a file that was much smaller than my available RAM and it turned out that there was an opening double quote on one line without a closing double quote.

我在一个比我的可用 RAM 小得多的文件中遇到了这个错误,结果发现一行中有一个开始双引号而没有一个结束双引号。

In this case, you can check the data, or you can change the quoting behavior of the parser, for example by passing quoting=3to pd.read_csv.

在这种情况下,您可以检查数据,也可以更改解析器的引用行为,例如传递quoting=3pd.read_csv.

回答by Vel

This is weird.

这很奇怪。

Actually I ran into the same situation.

其实我也遇到过同样的情况。

df_train = pd.read_csv('./train_set.csv')

But after I tried a lot of stuff to solve this error. And it works. Like this:

但是在我尝试了很多东西来解决这个错误之后。它有效。像这样:

dtypes = {'id': pd.np.int8,
          'article':pd.np.str,
          'word_seg':pd.np.str,
          'class':pd.np.int8}
df_train = pd.read_csv('./train_set.csv', dtype=dtypes)
df_test = pd.read_csv('./test_set.csv', dtype=dtypes)

Or this:

或这个:

ChunkSize = 10000
i = 1
for chunk in pd.read_csv('./train_set.csv', chunksize=ChunkSize): #分块合并
    df_train = chunk if i == 1 else pd.concat([df_train, chunk])
    print('-->Read Chunk...', i)
    i += 1

BUT!!!!!Suddenlly the original version works fine as well!

但是!!!!!!突然原始版本也能正常工作!

Like I did some useless work and I still have no idea where really went wrong.

就像我做了一些无用的工作,我仍然不知道哪里出了问题

I don't know what to say.

我不知道该说些什么。

回答by Konark Modi

You can use the command df.info(memory_usage="deep"), to find out the memory usage of data being loaded in the data frame.

您可以使用命令df.info(memory_usage="deep"), 找出数据帧中正在加载的数据的内存使用情况。

Few things to reduce Memory:

减少内存的几件事:

  1. Only load columns you need in the processing via usecolstable.
  2. Set dtypesfor these columns
  3. If your dtype is Object / String for some columns, you can try using the dtype="category". In my experience it reduced the memory usage drastically.
  1. 仅通过usecols表加载处理中需要的列。
  2. dtypes为这些列设置
  3. 如果某些列的 dtype 是 Object / String,则可以尝试使用dtype="category". 根据我的经验,它大大减少了内存使用量。