获取 CParserError。pandas 是否对单元格中值的最大大小进行了限制?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32766438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get CParserError. Does pandas post a limit to the maximum size of a value in a cell?
提问by zyxue
I have been trying to use pandas to analyze some genomics data. When reading a csv, I get the CParserError: Error tokenizing data. C error: out of memory
error, and I have narrowed down to the particular line that causes it, which is 43452. As shown below, the error doesn't happen until the parser goes beyond Line 43452.
我一直在尝试使用pandas来分析一些基因组数据。读取 csv 时,我收到CParserError: Error tokenizing data. C error: out of memory
错误,并且我已缩小到导致它的特定行,即 43452。如下所示,直到解析器超出行 43452 才会发生错误。
I have also pasted the relevant lines from less
output with the long sequences truncated, and the second column (seq_len) shows the length of that sequences. As you could see, some of the sequences are fairly long with a few millions of characters (i.e. bases in genomics). I wonder if the error is a result of too big a value in the csv. Does pandas post a limit to the length of a value at a cell? If so, how big is it?
我还粘贴了less
输出中截断的长序列的相关行,第二列 (seq_len) 显示了该序列的长度。如您所见,某些序列相当长,只有几百万个字符(即基因组学中的碱基)。我想知道错误是否是由于 csv 中的值太大造成的。pandas 是否对单元格中的值的长度进行了限制?如果有,它有多大?
BTW, the data.csv.gz
is about 9G in size if decompressed with less than 2 million lines. My system has over 100G memory, so I think physical memory is unlikely to be the cause.
顺便说一句,data.csv.gz
如果解压不到 200 万行,则大小约为 9G。我的系统内存超过100G,所以我认为物理内存不太可能是原因。
Successful read at Line 43451
在第 43451 行成功读取
In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
compression='gzip', header=None,
names=['accession', 'seq_len', 'tax_id', 'seq'],
nrows=43451)
Failed read at Line 43452
在第 43452 行读取失败
In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
compression='gzip', header=None,
names=['accession', 'seq_len', 'tax_id', 'seq'],
nrows=43452)
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472 skip_blank_lines=skip_blank_lines)
473
--> 474 return _read(filepath_or_buffer, kwds)
475
476 parser_f.__name__ = name
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254 " together yet.")
255 elif nrows is not None:
--> 256 return parser.read(nrows)
257 elif chunksize or iterator:
Successful258 return parser
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719 raise ValueError('skip_footer not supported for iteration')
720
--> 721 ret = self._engine.read(nrows)
722
723 if self.options.get('as_recarray'):
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
1168
1169 try:
-> 1170 data = self._reader.read(nrows)
1171 except StopIteration:
1172 if nrows is None:
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()
CParserError: Error tokenizing data. C error: out of memory
Line 43450-43455of less -N -S
output with long seq truncated. The first column is line number, after which are csv content separated by commas. The column names are ['accession', 'seq_len', 'tax_id', 'seq']
less -N -S
长序列被截断的输出的第 43450-43455 行。第一列是行号,之后是逗号分隔的csv内容。列名是 ['accession', 'seq_len', 'tax_id', 'seq']
43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....
回答by asalic
Well, the last line says it all, it doesn't have enough memory to split a chunk of data. I'm not sure how the archive block reading works and how much data it loads into memory, but it's clear that you will have to somehow control the size of the chunks. I found a solution here:
好吧,最后一行说明了一切,它没有足够的内存来拆分一大块数据。我不确定存档块读取的工作原理以及它加载到内存中的数据量,但很明显您必须以某种方式控制块的大小。我在这里找到了解决方案:
and here:
和这里:
out-of-memory-error-when-reading-csv-file-in-chunk
读取 csv-file-in-chunk 时出现内存不足错误
Please try to read the normal file line by line and see if it works.
请尝试逐行读取普通文件,看看它是否有效。