Pandas read_csv 加速
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42749107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read_csv speed up
提问by MysterioProgrammer91
I am reading a large csv which has around 10 million rows and 20 different columns (with header names).
我正在阅读一个大型 csv,它有大约 1000 万行和 20 个不同的列(带有标题名称)。
I have values, 2 columns with dates and some string.
我有值,2 列带有日期和一些字符串。
Currently it takes me around 1.5 minutes to load the data with something like this:
目前我需要大约 1.5 分钟来加载这样的数据:
df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')
I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.
我想问一下,一旦读取数据,我怎样才能使它显着更快,具有相同的数据帧。
I tried using HDF5 database, but it was just as slow.
我尝试使用 HDF5 数据库,但它同样慢。
Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):
我试图读取的数据子集(我选择了 8 列,并在实际 20 列和几百万行中给出了 3 行):
Date Comp Rating Price Estprice Dividend? Date_earnings Returns
3/12/2017 Apple Buy 100 114 Yes 4/4/2017 0.005646835
3/12/2017 Blackberry Sell 120 97 No 4/25/2017 0.000775331
3/12/2017 Microsoft Hold 140 100 Yes 5/28/2017 0.003028423
Thanks for the advice.
感谢您的建议。
采纳答案by MaxU
Let's test it!
让我们来测试一下!
Data generation:
数据生成:
sz = 10**3
df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])
df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))
df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))
df['f1'] = np.random.rand(len(df))
df['f2'] = np.random.rand(len(df))
# generate 10 string columns
for i in range(1, 11):
df['s{}'.format(i)] = pd.util.testing.rands_array(10, len(df))
df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)
df = df.set_index(df.pop('date').sort_values())
We have generated the following DF
我们生成了以下 DF
In [59]: df
Out[59]:
i1 i2 dt2 f1 ... s7 s8 s9 s10
date ...
2000-01-01 00:00:00 216625 4179 1980-01-04 04:35:24 0.679989 ... 7G8rLnoocA E7Ot7oPsJ6 puQamLn0I2 zxHrATQn0m
2000-01-01 00:00:00 374740 967991 1980-01-09 11:07:48 0.202064 ... wLETO2g8uL MhtzNLPXCH PW1uKxY0df wTakdCe6nK
2000-01-01 00:00:00 152181 627451 1980-01-10 11:49:39 0.956117 ... mXOsfUPqOy 6IIst7UFDT nL6XZxrT3r BxPCFNdZTK
2000-01-01 00:00:00 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa
2000-01-01 00:00:00 590262 248378 1980-01-06 11:48:45 0.307373 ... wRnMPxeopd JF24uTUwJC 2CRrs9yB2N hxYrXFnT1H
2000-01-01 00:00:00 161183 620876 1980-01-08 21:48:36 0.207536 ... cyN0AExPO2 POaldI6Y0l TDc13rPdT0 xgoDOW8Y1L
2000-01-01 00:00:00 589696 784856 1980-01-12 02:07:21 0.909340 ... GIRAAVBRpj xwcnpwFohz wqcoTMjQ4S GTcIWXElo7
... ... ... ... ... ... ... ... ... ...
2000-01-01 00:16:39 773606 205714 1980-01-12 07:40:21 0.895944 ... HEkXfD7pku 1ogy12wBom OT3KmQRFGz Dp1cK5R4Gq
2000-01-01 00:16:39 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa
2000-01-01 00:16:39 990722 567886 1980-01-03 05:50:06 0.676511 ... gVO3g0I97R yCqOhTVeEi imCCeQa0WG 9tslOJGWDJ
2000-01-01 00:16:39 531778 438944 1980-01-04 20:07:48 0.190714 ... rbLmkbnO5G ATm3BpWLC0 moLkyY2Msc 7A2UJERrBG
2000-01-01 00:16:39 880791 245911 1980-01-02 15:57:36 0.014967 ... bZuKNBvrEF K84u9HyAmG 4yy2bsUVNn WZQ5Vvl9zD
2000-01-01 00:16:39 239866 425516 1980-01-10 05:26:42 0.667183 ... 6xukg6TVah VEUz4d92B8 zHDxty6U3d ItztnI5LmJ
2000-01-01 00:16:39 338368 804695 1980-01-12 05:27:09 0.084818 ... NM4fdjKBuW LXGUbLIuw9 SHdpnttX6q 4oXKMsaOJ5
[1000000 rows x 15 columns]
In [60]: df.shape
Out[60]: (1000000, 15)
In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39
Data columns (total 15 columns):
i1 1000000 non-null int32
i2 1000000 non-null int32
dt2 1000000 non-null datetime64[ns]
f1 1000000 non-null float64
f2 1000000 non-null float64
s1 1000000 non-null object
s2 1000000 non-null object
s3 1000000 non-null object
s4 1000000 non-null object
s5 1000000 non-null object
s6 1000000 non-null object
s7 1000000 non-null object
s8 1000000 non-null object
s9 1000000 non-null object
s10 1000000 non-null object
dtypes: datetime64[ns](1), float64(2), int32(2), object(10)
memory usage: 114.4+ MB
#print(df.shape)
#print(df.info())
Let's write it to disk in different formats: (CSV, HDF5 fixed, HDF5 table, Feather):
让我们以不同的格式将其写入磁盘:(CSV、HDF5 固定、HDF5 表、Feather):
# CSV
df.to_csv('c:/tmp/test.csv')
# HDF5 table format
df.to_hdf('c:/tmp/test.h5', 'test', format='t')
# HDF5 fixed format
df.to_hdf('c:/tmp/test_fix.h5', 'test')
# Feather format
import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')
Timing:
定时:
Now we can measure reading from disk:
现在我们可以测量从磁盘读取:
In [54]: # CSV
...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)
1 loop, best of 3: 12.3 s per loop # 3rd place
In [55]: # HDF5 fixed format
...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')
1 loop, best of 3: 1.85 s per loop # 1st place
In [56]: # HDF5 table format
...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')
1 loop, best of 3: 24.2 s per loop # 4th place
In [57]: # Feather
...: %timeit feather.read_dataframe('c:/tmp/test.feather')
1 loop, best of 3: 3.21 s per loop # 2nd place
If you don't always need to read alldata, then it would make sense to store your data in HDF5 table format (and make use of data_columns
parameter in order to index those columns, that will be used for filtering).
如果您并不总是需要读取所有数据,那么以 HDF5 表格式存储数据是有意义的(并使用data_columns
参数来索引这些列,这些列将用于过滤)。
回答by Ryan Oz
A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.
在使用 Pandas 操作处理大型数据集(约 4-1000 万行,15-30 列)时,我通常采用的一种常见方法是将数据帧保存到 .pkl 文件中以备将来操作。就文件大小而言,它们确实占用了更多空间(有时高达 2 倍),但将我加载到 Jupyter Notebook 的时间从 csv 的 10-50 秒减少到 pkl 的大约 1-5 秒。
In [1]: %%time
dfTest = pd.read_pickle('./testFile.pkl')
print(dfTest.shape)
Out[1]: (10820089, 23)
Wall time: 1.89 s
In [2]: %%time
dfTest = pd.read_csv('./testFile.csv')
print(dfTest.shape)
Out[2]: (10820089, 23)
Wall time: 18.9 s
See the test file size differences used in this test here.
在此处查看此测试中使用的测试文件大小差异。
Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.
额外提示:在我完成对数据集的操作后,我通常只是将数据帧输出回 csv 以对我的项目进行较小的归档。
回答by jfbeltran
Depending on what you're using the data for you might benefit from the "chunksize" argument in read_csv, which you can find in the docs.
根据您使用数据的目的,您可能会从 read_csv 中的“chunksize”参数中受益,您可以在 docs 中找到该参数。
If you only need to do analysis on the data (you only need to load the data once) and you aren't already using an IDE like jupyter, definitely try it out! You can load the data once and keep it in memory as you try different operations.
如果您只需要对数据进行分析(您只需要加载一次数据)并且您还没有使用像jupyter这样的 IDE ,那么一定要试一试!当您尝试不同的操作时,您可以加载一次数据并将其保存在内存中。
Any other advice I think you might find in this thread.
我认为您可能会在此线程中找到任何其他建议。