pandas python fastparquet 模块可以读取压缩的镶木地板文件吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42234944/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can python fastparquet module read in compressed parquet file?
提问by user2322784
Our parquet files are stored in aws S3 bucket and are compressed by SNAPPY. I was able to use python fastparquet module to read in the uncompressed version of the parquet file but not the compressed version.
我们的镶木地板文件存储在 aws S3 存储桶中,并由 SNAPPY 压缩。我能够使用 python fastparquet 模块读取 parquet 文件的未压缩版本,但不能读取压缩版本。
This is the code I am using for uncompressed
这是我用于未压缩的代码
s3 = s3fs.S3FileSystem(key='XESF', secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()
This return no error but when I am trying to read in the snappy compressed version of the file:
这不会返回错误,但是当我尝试读取文件的 snappy 压缩版本时:
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)
I got error with to_pandas()
我的 to_pandas() 出错了
df=pf.to_pandas()
Error message
错误信息
KeyErrorTraceback (most recent call last) in () ----> 1 df=pf.to_pandas()
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index) 293 for (name, v) in views.items()} 294 self.read_row_group(rg, columns, categories, infile=f, --> 295 index=index, assign=parts) 296 start += rg.num_rows 297 else:
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign) 151 core.read_row_group( 152 infile, rg, columns, categories, self.helper, self.cats, --> 153 self.selfmade, index=index, assign=assign) 154 if ret: 155 return df
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign) 300 raise RuntimeError('Going with pre-allocation!') 301 read_row_group_arrays(file, rg, columns, categories, schema_helper, --> 302 cats, selfmade, assign=assign) 303 304 for cat in cats:
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign) 289 read_col(column, schema_helper, file, use_cat=use, 290 selfmade=selfmade, assign=out[name], --> 291 catdef=out[name+'-catdef'] if use else None) 292 293
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef) 196 dic = None 197 if ph.type == parquet_thrift.PageType.DICTIONARY_PAGE: --> 198 dic = np.array(read_dictionary_page(infile, schema_helper, ph, cmd)) 199 ph = read_thrift(infile, parquet_thrift.PageHeader) 200 dic = convert(dic, se)
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj, schema_helper, page_header, column_metadata) 152 Consumes data using the plain encoding and returns an array of values. 153 """ --> 154 raw_bytes = _read_page(file_obj, page_header, column_metadata) 155 if column_metadata.type == parquet_thrift.Type.BYTE_ARRAY: 156 # no faster way to read variable-length-strings?
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in _read_page(file_obj, page_header, column_metadata) 28 """Read the data page from the given file-object and convert it to raw, uncompressed bytes (if necessary).""" 29 raw_bytes = file_obj.read(page_header.compressed_page_size) ---> 30 raw_bytes = decompress_data(raw_bytes, column_metadata.codec) 31 32 assert len(raw_bytes) == page_header.uncompressed_page_size, \
/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in decompress_data(data, algorithm) 48 def decompress_data(data, algorithm='gzip'): 49 if isinstance(algorithm, int): ---> 50 algorithm = rev_map[algorithm] 51 if algorithm.upper() not in decompressions: 52 raise RuntimeError("Decompression '%s' not available. Options: %s" %
KeyError: 1
KeyErrorTraceback(最近一次调用最后一次)在 () ----> 1 df=pf.to_pandas()
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, category, filters, index) 293 for (name, v) in views.items()} 294 self. read_row_group(rg, columns, Categories, infile=f, --> 295 index=index,assign=parts) 296 start += rg.num_rows 297 else:
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index,assign) 151 core.read_row_group( 152 infile, rg, columns, Categories , self.helper, self.cats, --> 153 self.selfmade, index=index,assign=assign) 154 if ret: 155 return df
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(文件,rg,列,类别,schema_helper,猫,自制,索引,分配)300 raise RuntimeError('Going with pre-分配!')301 read_row_group_arrays(文件,rg,列,类别,schema_helper,--> 302 猫,自制,分配=分配)303 304 猫中的猫:
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(文件,rg,列,类别,schema_helper,猫,自制,分配)289 read_col(列,schema_helper,文件,use_cat=使用, 290 selfmade=selfmade,assign=out[name], --> 291 catdef=out[name+'-catdef'] 如果使用 else None) 292 293
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_col(column、schema_helper、infile、use_cat、grab_dict、selfmade、assign、catdef) 196 dic = None 197 if ph.type == parquet_thrift.PageType.DICTIONARY_PAGE:--> 198 dic = np.array(read_dictionary_page(infile, schema_helper, ph, cmd)) 199 ph = read_thrift(infile, parquet_thrift.PageHeader) 200 dic = convert(dic, se)
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj, schema_helper, page_header, column_metadata) 152 使用普通编码使用数据并返回值数组。153 """ --> 154 raw_bytes = _read_page(file_obj, page_header, column_metadata) 155 if column_metadata.type == parquet_thrift.Type.BYTE_ARRAY: 156 # 没有更快的读取可变长度字符串的方法吗?
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in _read_page(file_obj, page_header, column_metadata) 28 """从给定的文件对象中读取数据页并将其转换为未压缩的原始数据字节(如有必要)。""" 29 raw_bytes = file_obj.read(page_header.compressed_page_size) ---> 30 raw_bytes = decompress_data(raw_bytes, column_metadata.codec) 31 32 assert len(raw_bytes) == page_header.uncompressed_page_size, \
/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in decompress_data(data, algorithm) 48 def decompress_data(data, algorithm='gzip'): 49 if isinstance(algorithm, int): - --> 50 algorithm = rev_map[algorithm] 51 如果 algorithm.upper() 不在解压中:52 raise RuntimeError("解压 '%s' 不可用。选项:%s" %
关键错误:1
回答by mdurant
The error likely indicates that the library for decompressing SNAPPY was not found on your system - although clearly the error message could be clearer!
该错误可能表明在您的系统上找不到用于解压缩 SNAPPY 的库 - 尽管显然错误消息可能更清晰!
Depending on your system, the following lines may solve this for you:
根据您的系统,以下几行可能会为您解决这个问题:
conda install python-snappy
or
或者
pip install python-snappy
If you are on windows, the build chain may not work, and perhaps you need to install from here.
如果您在 Windows 上,构建链可能不起作用,也许您需要从这里安装。