pandas python fastparquet 模块可以读取压缩的镶木地板文件吗？

Question

提问by user2322784

Our parquet files are stored in aws S3 bucket and are compressed by SNAPPY. I was able to use python fastparquet module to read in the uncompressed version of the parquet file but not the compressed version.

我们的镶木地板文件存储在 aws S3 存储桶中，并由 SNAPPY 压缩。我能够使用 python fastparquet 模块读取 parquet 文件的未压缩版本，但不能读取压缩版本。

This is the code I am using for uncompressed

这是我用于未压缩的代码

s3 = s3fs.S3FileSystem(key='XESF',    secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()

This return no error but when I am trying to read in the snappy compressed version of the file:

这不会返回错误，但是当我尝试读取文件的 snappy 压缩版本时：

pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)

I got error with to_pandas()

我的 to_pandas() 出错了

df=pf.to_pandas()

Error message

错误信息

KeyErrorTraceback (most recent call last) in () ----> 1 df=pf.to_pandas()
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index) 293 for (name, v) in views.items()} 294 self.read_row_group(rg, columns, categories, infile=f, --> 295 index=index, assign=parts) 296 start += rg.num_rows 297 else:
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign) 151 core.read_row_group( 152 infile, rg, columns, categories, self.helper, self.cats, --> 153 self.selfmade, index=index, assign=assign) 154 if ret: 155 return df
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign) 300 raise RuntimeError('Going with pre-allocation!') 301 read_row_group_arrays(file, rg, columns, categories, schema_helper, --> 302 cats, selfmade, assign=assign) 303 304 for cat in cats:
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign) 289 read_col(column, schema_helper, file, use_cat=use, 290 selfmade=selfmade, assign=out[name], --> 291 catdef=out[name+'-catdef'] if use else None) 292 293
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef) 196 dic = None 197 if ph.type == parquet_thrift.PageType.DICTIONARY_PAGE: --> 198 dic = np.array(read_dictionary_page(infile, schema_helper, ph, cmd)) 199 ph = read_thrift(infile, parquet_thrift.PageHeader) 200 dic = convert(dic, se)
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj, schema_helper, page_header, column_metadata) 152 Consumes data using the plain encoding and returns an array of values. 153 """ --> 154 raw_bytes = _read_page(file_obj, page_header, column_metadata) 155 if column_metadata.type == parquet_thrift.Type.BYTE_ARRAY: 156 # no faster way to read variable-length-strings?
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in _read_page(file_obj, page_header, column_metadata) 28 """Read the data page from the given file-object and convert it to raw, uncompressed bytes (if necessary).""" 29 raw_bytes = file_obj.read(page_header.compressed_page_size) ---> 30 raw_bytes = decompress_data(raw_bytes, column_metadata.codec) 31 32 assert len(raw_bytes) == page_header.uncompressed_page_size, \
/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in decompress_data(data, algorithm) 48 def decompress_data(data, algorithm='gzip'): 49 if isinstance(algorithm, int): ---> 50 algorithm = rev_map[algorithm] 51 if algorithm.upper() not in decompressions: 52 raise RuntimeError("Decompression '%s' not available. Options: %s" %
KeyError: 1

KeyErrorTraceback（最近一次调用最后一次）在 () ----> 1 df=pf.to_pandas()
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, category, filters, index) 293 for (name, v) in views.items()} 294 self. read_row_group(rg, columns, Categories, infile=f, --> 295 index=index,assign=parts) 296 start += rg.num_rows 297 else:
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index,assign) 151 core.read_row_group( 152 infile, rg, columns, Categories , self.helper, self.cats, --> 153 self.selfmade, index=index,assign=assign) 154 if ret: 155 return df
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group（文件，rg，列，类别，schema_helper，猫，自制，索引，分配）300 raise RuntimeError('Going with pre-分配！'）301 read_row_group_arrays（文件，rg，列，类别，schema_helper，--> 302 猫，自制，分配=分配）303 304 猫中的猫：
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays（文件，rg，列，类别，schema_helper，猫，自制，分配）289 read_col（列，schema_helper，文件，use_cat=使用, 290 selfmade=selfmade,assign=out[name], --> 291 catdef=out[name+'-catdef'] 如果使用 else None) 292 293
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_col(column、schema_helper、infile、use_cat、grab_dict、selfmade、assign、catdef) 196 dic = None 197 if ph.type == parquet_thrift.PageType.DICTIONARY_PAGE：--> 198 dic = np.array(read_dictionary_page(infile, schema_helper, ph, cmd)) 199 ph = read_thrift(infile, parquet_thrift.PageHeader) 200 dic = convert(dic, se)
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj, schema_helper, page_header, column_metadata) 152 使用普通编码使用数据并返回值数组。153 """ --> 154 raw_bytes = _read_page(file_obj, page_header, column_metadata) 155 if column_metadata.type == parquet_thrift.Type.BYTE_ARRAY: 156 # 没有更快的读取可变长度字符串的方法吗？
/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in _read_page(file_obj, page_header, column_metadata) 28 """从给定的文件对象中读取数据页并将其转换为未压缩的原始数据字节（如有必要）。""" 29 raw_bytes = file_obj.read(page_header.compressed_page_size) ---> 30 raw_bytes = decompress_data(raw_bytes, column_metadata.codec) 31 32 assert len(raw_bytes) == page_header.uncompressed_page_size, \
/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in decompress_data(data, algorithm) 48 def decompress_data(data, algorithm='gzip'): 49 if isinstance(algorithm, int): - --> 50 algorithm = rev_map[algorithm] 51 如果 algorithm.upper() 不在解压中：52 raise RuntimeError("解压 '%s' 不可用。选项：%s" %
关键错误：1

Answer 1

回答by mdurant

The error likely indicates that the library for decompressing SNAPPY was not found on your system - although clearly the error message could be clearer!

该错误可能表明在您的系统上找不到用于解压缩 SNAPPY 的库 - 尽管显然错误消息可能更清晰！

Depending on your system, the following lines may solve this for you:

根据您的系统，以下几行可能会为您解决这个问题：

conda install python-snappy

or

或者

pip install python-snappy

If you are on windows, the build chain may not work, and perhaps you need to install from here.

如果您在 Windows 上，构建链可能不起作用，也许您需要从这里安装。

pandas python fastparquet 模块可以读取压缩的镶木地板文件吗？

提问by user2322784

回答by mdurant

相关推荐

最近更新

标签

pandas python fastparquet 模块可以读取压缩的镶木地板文件吗？

提问by user2322784

回答by mdurant

相关推荐

Pandas - 将 AM/PM 格式更改为 24 小时

pandas 根据某些列（熊猫）中的空值删除行

将 html 表转换为 Pandas 数据框

pandas 类型错误：pivot_table() 为关键字参数“values”获得了多个值

相关推荐

最近更新

标签