Pandas read_csv 加速

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42749107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:09:56  来源:igfitidea点击:

Pandas read_csv speed up

pythondatabasepandas

提问by MysterioProgrammer91

I am reading a large csv which has around 10 million rows and 20 different columns (with header names).

我正在阅读一个大型 csv,它有大约 1000 万行和 20 个不同的列(带有标题名称)。

I have values, 2 columns with dates and some string.

我有值,2 列带有日期和一些字符串。

Currently it takes me around 1.5 minutes to load the data with something like this:

目前我需要大约 1.5 分钟来加载这样的数据:

df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')

I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.

我想问一下,一旦读取数据,我怎样才能使它显着更快,具有相同的数据帧。

I tried using HDF5 database, but it was just as slow.

我尝试使用 HDF5 数据库,但它同样慢。

Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):

我试图读取的数据子集(我选择了 8 列,并在实际 20 列和几百万行中给出了 3 行):

Date    Comp     Rating Price   Estprice    Dividend?   Date_earnings   Returns
3/12/2017   Apple   Buy   100   114              Yes    4/4/2017    0.005646835
3/12/2017   Blackberry  Sell    120 97            No    4/25/2017   0.000775331
3/12/2017   Microsoft   Hold    140 100          Yes    5/28/2017   0.003028423

Thanks for the advice.

感谢您的建议。

采纳答案by MaxU

Let's test it!

让我们来测试一下!

Data generation:

数据生成:

sz = 10**3

df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])
df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))
df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))
df['f1'] = np.random.rand(len(df))
df['f2'] = np.random.rand(len(df))
# generate 10 string columns 
for i in range(1, 11):
    df['s{}'.format(i)] =  pd.util.testing.rands_array(10, len(df))

df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)
df = df.set_index(df.pop('date').sort_values())

We have generated the following DF

我们生成了以下 DF

In [59]: df
Out[59]:
                         i1      i2                 dt2        f1     ...              s7          s8          s9         s10
date                                                                  ...
2000-01-01 00:00:00  216625    4179 1980-01-04 04:35:24  0.679989     ...      7G8rLnoocA  E7Ot7oPsJ6  puQamLn0I2  zxHrATQn0m
2000-01-01 00:00:00  374740  967991 1980-01-09 11:07:48  0.202064     ...      wLETO2g8uL  MhtzNLPXCH  PW1uKxY0df  wTakdCe6nK
2000-01-01 00:00:00  152181  627451 1980-01-10 11:49:39  0.956117     ...      mXOsfUPqOy  6IIst7UFDT  nL6XZxrT3r  BxPCFNdZTK
2000-01-01 00:00:00  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa
2000-01-01 00:00:00  590262  248378 1980-01-06 11:48:45  0.307373     ...      wRnMPxeopd  JF24uTUwJC  2CRrs9yB2N  hxYrXFnT1H
2000-01-01 00:00:00  161183  620876 1980-01-08 21:48:36  0.207536     ...      cyN0AExPO2  POaldI6Y0l  TDc13rPdT0  xgoDOW8Y1L
2000-01-01 00:00:00  589696  784856 1980-01-12 02:07:21  0.909340     ...      GIRAAVBRpj  xwcnpwFohz  wqcoTMjQ4S  GTcIWXElo7
...                     ...     ...                 ...       ...     ...             ...         ...         ...         ...
2000-01-01 00:16:39  773606  205714 1980-01-12 07:40:21  0.895944     ...      HEkXfD7pku  1ogy12wBom  OT3KmQRFGz  Dp1cK5R4Gq
2000-01-01 00:16:39  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa
2000-01-01 00:16:39  990722  567886 1980-01-03 05:50:06  0.676511     ...      gVO3g0I97R  yCqOhTVeEi  imCCeQa0WG  9tslOJGWDJ
2000-01-01 00:16:39  531778  438944 1980-01-04 20:07:48  0.190714     ...      rbLmkbnO5G  ATm3BpWLC0  moLkyY2Msc  7A2UJERrBG
2000-01-01 00:16:39  880791  245911 1980-01-02 15:57:36  0.014967     ...      bZuKNBvrEF  K84u9HyAmG  4yy2bsUVNn  WZQ5Vvl9zD
2000-01-01 00:16:39  239866  425516 1980-01-10 05:26:42  0.667183     ...      6xukg6TVah  VEUz4d92B8  zHDxty6U3d  ItztnI5LmJ
2000-01-01 00:16:39  338368  804695 1980-01-12 05:27:09  0.084818     ...      NM4fdjKBuW  LXGUbLIuw9  SHdpnttX6q  4oXKMsaOJ5

[1000000 rows x 15 columns]

In [60]: df.shape
Out[60]: (1000000, 15)

In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39
Data columns (total 15 columns):
i1     1000000 non-null int32
i2     1000000 non-null int32
dt2    1000000 non-null datetime64[ns]
f1     1000000 non-null float64
f2     1000000 non-null float64
s1     1000000 non-null object
s2     1000000 non-null object
s3     1000000 non-null object
s4     1000000 non-null object
s5     1000000 non-null object
s6     1000000 non-null object
s7     1000000 non-null object
s8     1000000 non-null object
s9     1000000 non-null object
s10    1000000 non-null object
dtypes: datetime64[ns](1), float64(2), int32(2), object(10)
memory usage: 114.4+ MB

#print(df.shape)
#print(df.info())

Let's write it to disk in different formats: (CSV, HDF5 fixed, HDF5 table, Feather):

让我们以不同的格式将其写入磁盘:(CSV、HDF5 固定、HDF5 表、Feather):

# CSV
df.to_csv('c:/tmp/test.csv')
# HDF5 table format
df.to_hdf('c:/tmp/test.h5', 'test', format='t')
#  HDF5 fixed format
df.to_hdf('c:/tmp/test_fix.h5', 'test')
# Feather format
import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')

Timing:

定时:

Now we can measure reading from disk:

现在我们可以测量从磁盘读取:

In [54]: # CSV
    ...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)
1 loop, best of 3: 12.3 s per loop   # 3rd place

In [55]: # HDF5 fixed format
    ...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')
1 loop, best of 3: 1.85 s per loop   # 1st place

In [56]: # HDF5 table format
    ...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')
1 loop, best of 3: 24.2 s per loop   # 4th place

In [57]: # Feather
    ...: %timeit feather.read_dataframe('c:/tmp/test.feather')
1 loop, best of 3: 3.21 s per loop   # 2nd place

If you don't always need to read alldata, then it would make sense to store your data in HDF5 table format (and make use of data_columnsparameter in order to index those columns, that will be used for filtering).

如果您并不总是需要读取所有数据,那么以 HDF5 表格式存储数据是有意义的(并使用data_columns参数来索引这些列,这些列将用于过滤)。

回答by Ryan Oz

A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.

在使用 Pandas 操作处理大型数据集(约 4-1000 万行,15-30 列)时,我通常采用的一种常见方法是将数据帧保存到 .pkl 文件中以备将来操作。就文件大小而言,它们确实占用了更多空间(有时高达 2 倍),但将我加载到 Jupyter Notebook 的时间从 csv 的 10-50 秒减少到 pkl 的大约 1-5 秒。

In [1]: %%time
        dfTest = pd.read_pickle('./testFile.pkl')
        print(dfTest.shape)
Out[1]: (10820089, 23)
        Wall time: 1.89 s

In [2]: %%time
        dfTest = pd.read_csv('./testFile.csv')
        print(dfTest.shape)
Out[2]: (10820089, 23)
        Wall time: 18.9 s

See the test file size differences used in this test here.

此处查看此测试中使用的测试文件大小差异。

Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.

额外提示:在我完成对数据集的操作后,我通常只是将数据帧输出回 csv 以对我的项目进行较小的归档。

回答by jfbeltran

Depending on what you're using the data for you might benefit from the "chunksize" argument in read_csv, which you can find in the docs.

根据您使用数据的目的,您可能会从 read_csv 中的“chunksize”参数中受益,您可以在 docs 中找到该参数

If you only need to do analysis on the data (you only need to load the data once) and you aren't already using an IDE like jupyter, definitely try it out! You can load the data once and keep it in memory as you try different operations.

如果您只需要对数据进行分析(您只需要加载一次数据)并且您还没有使用像jupyter这样的 IDE ,那么一定要试一试!当您尝试不同的操作时,您可以加载一次数据并将其保存在内存中。

Any other advice I think you might find in this thread.

我认为您可能会在此线程中找到任何其他建议。