在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37010212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:10:05  来源:igfitidea点击:

What is the fastest way to upload a big csv file in notebook to work with python pandas?

pythoncsvpandasdataframe

提问by hernanavella

I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

我正在尝试上传一个 250MB 的 csv 文件。基本上是 400 万行和 6 列的时间序列数据(1 分钟)。通常的程序是:

location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
df = pd.read_csv(location)

This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

此过程大约需要 20 分钟!!!。非常初步我已经探索了以下选项

I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

我想知道是否有人比较过这些选项(或更多选项)并且有明显的赢家。如果没有人回答,将来我会发布我的结果。我只是现在没有时间。

回答by MaxU

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

这是我对 DF 的读写比较的结果(形状:4000000 x 6,内存大小 183.1 MB,未压缩 CSV 的大小 - 492 MB)。

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5[various compression]):

以下存储格式的比较:(CSV, CSV.gzip, Pickle, HDF5[各种压缩]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

reading

enter image description here

在此处输入图片说明

writing/saving

写/保存

enter image description here

在此处输入图片说明

file size ratio in relation to uncompressed CSV file

与未压缩的 CSV 文件相关的文件大小比率

enter image description here

在此处输入图片说明

RAW DATA:

原始数据:

CSV:

CSV:

In [68]: %timeit df.to_csv(fcsv)
1 loop, best of 3: 1min 9s per loop

In [74]: %timeit pd.read_csv(fcsv)
1 loop, best of 3: 17.9 s per loop

CSV.gzip:

CSV.gzip:

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
1 loop, best of 3: 3min 6s per loop

In [75]: %timeit pd.read_csv(fcsv_gz)
1 loop, best of 3: 18.9 s per loop

Pickle:

泡菜:

In [66]: %timeit df.to_pickle(fpckl)
1 loop, best of 3: 1.77 s per loop

In [72]: %timeit pd.read_pickle(fpckl)
10 loops, best of 3: 173 ms per loop

HDF (format='fixed') [Default]:

HDF ( format='fixed') [默认]:

In [67]: %timeit df.to_hdf(fh5, 'df')
1 loop, best of 3: 2.03 s per loop

In [73]: %timeit pd.read_hdf(fh5, 'df')
10 loops, best of 3: 196 ms per loop

HDF (format='table'):

HDF ( format='table'):

In [37]: %timeit df.to_hdf('D:\temp\.data\37010212_tab.h5', 'df', format='t')
1 loop, best of 3: 2.6 s per loop

In [38]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab.h5', 'df')
1 loop, best of 3: 230 ms per loop

HDF (format='table', complib='zlib', complevel=5):

HDF ( format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
1 loop, best of 3: 5.44 s per loop

In [41]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib5.h5', 'df')
1 loop, best of 3: 854 ms per loop

HDF (format='table', complib='zlib', complevel=9):

HDF ( format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
1 loop, best of 3: 5.95 s per loop

In [39]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_zlib9.h5', 'df')
1 loop, best of 3: 860 ms per loop

HDF (format='table', complib='bzip2', complevel=5):

HDF ( format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l5.h5', 'df')
1 loop, best of 3: 2.5 s per loop

HDF (format='table', complib='bzip2', complevel=9):

HDF ( format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\temp\.data\37010212_tab_compress_bzip2_l9.h5', 'df')
1 loop, best of 3: 2.5 s per loop

PS i can't test featheron my Windowsnotebook

PS 我无法feather在我的Windows笔记本上测试

DF info:

DF信息:

In [49]: df.shape
Out[49]: (4000000, 6)

In [50]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 6 columns):
a    datetime64[ns]
b    datetime64[ns]
c    datetime64[ns]
d    datetime64[ns]
e    datetime64[ns]
f    datetime64[ns]
dtypes: datetime64[ns](6)
memory usage: 183.1 MB

In [41]: df.head()
Out[41]:
                    a                   b                   c  \
0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00

                    d                   e                   f
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

File sizes:

文件大小:

{ .data }  ? ls -lh 37010212.*                                                                          /d/temp/.data
-rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
-rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
-rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
-rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
-rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
-rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5

Conclusion:

结论:

Pickleand HDF5are much faster, but HDF5is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at whereparameter in read_hdf()), you can also store your data compressed (zlib- is faster, bzip2- provides better compression ratio), etc.

Pickle并且HDF5更快,但HDF5更方便 - 您可以在其中存储多个表/框架,您可以有条件地读取您的数据(查看read_hdf()中的where参数),您还可以存储您的数据压缩(- 更快,- 提供更好的压缩比)等。zlibbzip2

PS if you can build/use feather-format- it should be even faster compared to HDF5and Pickle

PS,如果您可以构建/使用feather-format- 与HDF5和相比它应该更快Pickle

PPS:don't use Pickle for big data frames, as you may end up with SystemError: error return without exception seterror message. It's also described hereand here.

PPS:不要将 Pickle 用于大数据帧,因为您最终可能会遇到SystemError: error return without exception set错误消息。此处此处也对其进行了描述。