在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么？

Question

提问by hernanavella

I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

我正在尝试上传一个 250MB 的 csv 文件。基本上是 400 万行和 6 列的时间序列数据（1 分钟）。通常的程序是：

location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
df = pd.read_csv(location)

This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

此过程大约需要 20 分钟！！！。非常初步我已经探索了以下选项

I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

我想知道是否有人比较过这些选项（或更多选项）并且有明显的赢家。如果没有人回答，将来我会发布我的结果。我只是现在没有时间。

Answer 1

回答by MaxU

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

这是我对 DF 的读写比较的结果（形状：4000000 x 6，内存大小 183.1 MB，未压缩 CSV 的大小 - 492 MB）。

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5[various compression]):

以下存储格式的比较：（CSV, CSV.gzip, Pickle, HDF5[各种压缩]）：

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

reading

读

writing/saving

写/保存

file size ratio in relation to uncompressed CSV file

与未压缩的 CSV 文件相关的文件大小比率

RAW DATA:

原始数据：

CSV:

CSV：

In [68]: %timeit df.to_csv(fcsv)
1 loop, best of 3: 1min 9s per loop

In [74]: %timeit pd.read_csv(fcsv)
1 loop, best of 3: 17.9 s per loop

CSV.gzip:

CSV.gzip：

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
1 loop, best of 3: 3min 6s per loop

In [75]: %timeit pd.read_csv(fcsv_gz)
1 loop, best of 3: 18.9 s per loop

Pickle:

泡菜：

In [66]: %timeit df.to_pickle(fpckl)
1 loop, best of 3: 1.77 s per loop

In [72]: %timeit pd.read_pickle(fpckl)
10 loops, best of 3: 173 ms per loop

HDF (format='fixed') [Default]:

HDF ( format='fixed') [默认]：

In [67]: %timeit df.to_hdf(fh5, 'df')
1 loop, best of 3: 2.03 s per loop

In [73]: %timeit pd.read_hdf(fh5, 'df')
10 loops, best of 3: 196 ms per loop

HDF (format='table'):

HDF ( format='table'):

In [37]: %timeit df.to_hdf('D:\temp\.data\37010212_tab.h5', 'df', format='t')
1 loop, best of 3: 2.6 s per loop

In [38]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab.h5', 'df')
1 loop, best of 3: 230 ms per loop

HDF (format='table', complib='zlib', complevel=5):

HDF ( format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
1 loop, best of 3: 5.44 s per loop

In [41]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df')
1 loop, best of 3: 854 ms per loop

HDF (format='table', complib='zlib', complevel=9):

HDF ( format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
1 loop, best of 3: 5.95 s per loop

In [39]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df')
1 loop, best of 3: 860 ms per loop

HDF (format='table', complib='bzip2', complevel=5):

HDF ( format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df')
1 loop, best of 3: 2.5 s per loop

HDF (format='table', complib='bzip2', complevel=9):

HDF ( format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df')
1 loop, best of 3: 2.5 s per loop

PS i can't test featheron my Windowsnotebook

PS 我无法feather在我的Windows笔记本上测试

DF info:

DF信息：

In [49]: df.shape
Out[49]: (4000000, 6)

In [50]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 6 columns):
a    datetime64[ns]
b    datetime64[ns]
c    datetime64[ns]
d    datetime64[ns]
e    datetime64[ns]
f    datetime64[ns]
dtypes: datetime64[ns](6)
memory usage: 183.1 MB

In [41]: df.head()
Out[41]:
                    a                   b                   c  \
0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00

                    d                   e                   f
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

File sizes:

文件大小：

{ .data }  ? ls -lh 37010212.*                                                                          /d/temp/.data
-rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
-rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
-rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
-rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
-rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
-rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5

Conclusion:

结论：

Pickleand HDF5are much faster, but HDF5is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at whereparameter in read_hdf()), you can also store your data compressed (zlib- is faster, bzip2- provides better compression ratio), etc.

Pickle并且HDF5更快，但HDF5更方便 - 您可以在其中存储多个表/框架，您可以有条件地读取您的数据（查看read_hdf()中的where参数），您还可以存储您的数据压缩（- 更快，- 提供更好的压缩比）等。zlibbzip2

PS if you can build/use feather-format- it should be even faster compared to HDF5and Pickle

PS，如果您可以构建/使用feather-format- 与HDF5和相比它应该更快Pickle

PPS:don't use Pickle for big data frames, as you may end up with SystemError: error return without exception seterror message. It's also described hereand here.

PPS：不要将 Pickle 用于大数据帧，因为您最终可能会遇到SystemError: error return without exception set错误消息。此处和此处也对其进行了描述。

在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么？

提问by hernanavella

回答by MaxU

相关推荐

最近更新

标签

在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么？

提问by hernanavella

回答by MaxU

相关推荐

Pandas read_csv，读取缺少标题元素的csv文件

Pandas - 按连续范围分组

pandas 熊猫切片系列

AttributeError: 'function' 对象在 Pandas 中没有属性 'bar'

相关推荐

最近更新

标签