Python 羽毛和镶木地板有什么区别？

Question

提问by Darkonaut

Both are columnar (disk-)storage formatsfor use in data analysis systems. Both are integrated within Apache Arrow(pyarrowpackage for python) and are designed to correspond with Arrowas a columnar in-memory analytics layer.

两者都是用于数据分析系统的柱状（磁盘）存储格式。两者都集成在Apache Arrow（python 的pyarrow包）中，旨在与Arrow作为柱状内存分析层对应。

How do both formats differ?

两种格式有何不同？

Should you always prefer feather when working with pandas when possible?

在可能的情况下，与熊猫一起工作时，您是否应该总是喜欢羽毛？

What are the use cases where featheris more suitable than parquetand the other way round?

羽毛比镶木地板更适合哪些用例，反之亦然？

Appendix

附录

I found some hints here https://github.com/wesm/feather/issues/188, but given the young age of this project, it's possibly a bit out of date.

我在这里找到了一些提示https://github.com/wesm/feather/issues/188，但考虑到这个项目的年轻，它可能有点过时了。

Not a serious speed test because I'm just dumping and loading a whole Dataframe but to give you some impression if you never heard of the formats before:

不是一个严肃的速度测试，因为我只是转储和加载整个数据帧，但如果您以前从未听说过这些格式，请给您一些印象：

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 μs ± 1.8 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html

Answer 1

回答by Wes McKinney

Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)
Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.
Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files
Parquet is a standard storage format for analytics that's supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

Parquet 格式是为长期存储而设计的，其中 Arrow 更适合短期或短暂的存储（Arrow 可能更适合在 1.0.0 发布后长期存储，因为二进制格式届时将稳定）
Parquet 比 Feather 更昂贵，因为它具有更多的编码和压缩层。Feather 是未经修改的原始柱状 Arrow 内存。将来我们可能会为 Feather 添加简单的压缩。
由于字典编码、RLE 编码和数据页压缩，Parquet 文件通常会比 Feather 文件小很多
Parquet 是许多不同系统支持的分析标准存储格式：Spark、Hive、Impala、各种 AWS 服务，未来由 BigQuery 等。因此，如果您进行分析，Parquet 是作为参考存储格式的不错选择多系统查询

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

您展示的基准测试将非常嘈杂，因为您读取和写入的数据非常小。您应该尝试压缩至少 100MB 或 1GB 以上的数据以获得更多信息基准，例如参见http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

希望这可以帮助

Python 羽毛和镶木地板有什么区别？

提问by Darkonaut

回答by Wes McKinney

相关推荐

最近更新

标签

Python 羽毛和镶木地板有什么区别？

提问by Darkonaut

回答by Wes McKinney

相关推荐

Python jinja2 如何删除尾随换行符

Python 16 个任务的序列化结果总大小 (1048.5 MB) 大于 spark.driver.maxResultSize (1024.0 MB)

Python 如何使用字符串访问 Pandas DataFrame 日期时间索引

Python 使用可迭代对象进行设置时必须具有相等的 len 键和值

相关推荐

最近更新

标签