获取 CParserError。pandas 是否对单元格中值的最大大小进行了限制？

Question

提问by zyxue

I have been trying to use pandas to analyze some genomics data. When reading a csv, I get the CParserError: Error tokenizing data. C error: out of memoryerror, and I have narrowed down to the particular line that causes it, which is 43452. As shown below, the error doesn't happen until the parser goes beyond Line 43452.

我一直在尝试使用pandas来分析一些基因组数据。读取 csv 时，我收到CParserError: Error tokenizing data. C error: out of memory错误，并且我已缩小到导致它的特定行，即 43452。如下所示，直到解析器超出行 43452 才会发生错误。

I have also pasted the relevant lines from lessoutput with the long sequences truncated, and the second column (seq_len) shows the length of that sequences. As you could see, some of the sequences are fairly long with a few millions of characters (i.e. bases in genomics). I wonder if the error is a result of too big a value in the csv. Does pandas post a limit to the length of a value at a cell? If so, how big is it?

我还粘贴了less输出中截断的长序列的相关行，第二列 (seq_len) 显示了该序列的长度。如您所见，某些序列相当长，只有几百万个字符（即基因组学中的碱基）。我想知道错误是否是由于 csv 中的值太大造成的。pandas 是否对单元格中的值的长度进行了限制？如果有，它有多大？

BTW, the data.csv.gzis about 9G in size if decompressed with less than 2 million lines. My system has over 100G memory, so I think physical memory is unlikely to be the cause.

顺便说一句，data.csv.gz如果解压不到 200 万行，则大小约为 9G。我的系统内存超过100G，所以我认为物理内存不太可能是原因。

Successful read at Line 43451

在第 43451 行成功读取

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43451)

Failed read at Line 43452

在第 43452 行读取失败

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43452)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472                     skip_blank_lines=skip_blank_lines)
    473
    --> 474         return _read(filepath_or_buffer, kwds)
    475
        476     parser_f.__name__ = name

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254                                   " together yet.")
    255     elif nrows is not None:
    --> 256         return parser.read(nrows)
    257     elif chunksize or iterator:
        Successful258         return parser

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719                 raise ValueError('skip_footer not supported for iteration')
    720
    --> 721         ret = self._engine.read(nrows)
    722
        723         if self.options.get('as_recarray'):

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168
  1169         try:
  -> 1170             data = self._reader.read(nrows)
     1171         except StopIteration:
    1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()

CParserError: Error tokenizing data. C error: out of memory

Line 43450-43455of less -N -Soutput with long seq truncated. The first column is line number, after which are csv content separated by commas. The column names are ['accession', 'seq_len', 'tax_id', 'seq']

less -N -S长序列被截断的输出的第 43450-43455 行。第一列是行号，之后是逗号分隔的csv内容。列名是 ['accession', 'seq_len', 'tax_id', 'seq']

43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....

Answer 1

回答by asalic

Well, the last line says it all, it doesn't have enough memory to split a chunk of data. I'm not sure how the archive block reading works and how much data it loads into memory, but it's clear that you will have to somehow control the size of the chunks. I found a solution here:

好吧，最后一行说明了一切，它没有足够的内存来拆分一大块数据。我不确定存档块读取的工作原理以及它加载到内存中的数据量，但很明显您必须以某种方式控制块的大小。我在这里找到了解决方案：

pandas-read-csv-out-of-memory

and here:

和这里：

out-of-memory-error-when-reading-csv-file-in-chunk

读取 csv-file-in-chunk 时出现内存不足错误

Please try to read the normal file line by line and see if it works.

请尝试逐行读取普通文件，看看它是否有效。

获取 CParserError。pandas 是否对单元格中值的最大大小进行了限制？

提问by zyxue

回答by asalic

相关推荐

最近更新

标签

获取 CParserError。pandas 是否对单元格中值的最大大小进行了限制？

提问by zyxue

回答by asalic

相关推荐

pandas Python 熊猫相关 corr() TypeError：无法将 ['pearson'] 与块值进行比较

Python Pandas 在循环中创建新列

保留列顺序 - Python Pandas 和 Column Concat

Python Pandas：如何将一行移动到数据框的第一行？

相关推荐

最近更新

标签