Python 提高 Pandas 合并性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40860457/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:06:35  来源:igfitidea点击:

Improve Pandas Merge performance

pythonpandasmergecythonnumba

提问by Debasish Kanhar

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.

正如其他帖子所建议的那样,我特别没有 Pands Merge 的性能问题,但我有一个类,其中有很多方法,可以对数据集进行大量合并。

The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.

班级有大约 10 个小组和大约 15 个合并。虽然 groupby 非常快,但在 1.5 秒的总执行时间中,这 15 次合并调用大约需要 0.7 秒。

I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.

我想加快这些合并调用的性能。由于我将有大约 4000 次迭代,因此在单次迭代中总共节省 0.5 秒将导致整体性能降低约 30 分钟,这将是很棒的。

Any suggestions I should try? I tried: Cython Numba, and Numba was slower.

我应该尝试的任何建议?我试过:Cython Numba,而 Numba 更慢。

Thanks

谢谢

Edit 1: Adding sample code snippets: My merge statements:

编辑 1:添加示例代码片段:我的合并语句:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

And, by implementing Joins, I incorporate the following satatements:

而且,通过实施 Joins,我合并了以下声明:

dat = self.data.set_index('APPT_NBR')

t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)

tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')

tmpDf.reset_index(inplace=True)

Note, all are part of a function named: def merge_earlier_created_values(self):

请注意,所有这些都是名为def merge_earlier_created_values(self)的函数的一部分

And, when I did timedcall from profilehooks by following:

而且,当我通过以下方式从 profilehooks 进行 timedcall 时:

@timedcall(immediate=True)
def merge_earlier_created_values(self):

I get following results:

我得到以下结果:

The result of profiling of that method gives:

该方法的分析结果给出:

@profile(immediate=True)
def merge_earlier_created_values(self):

The profiling of function, by using Merge is as follows:

使用 Merge 对函数的分析如下:

*** PROFILER RESULTS ***
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon     Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122)
function called 1 times

     71665 function calls (70588 primitive calls) in 0.524 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)
    7    0.283    0.040    0.283    0.040 {built-in method gc.collect}
   15    0.000    0.000    0.181    0.012 generic.py:1842(drop)
   10    0.000    0.000    0.153    0.015 merge.py:26(merge)
   10    0.000    0.000    0.140    0.014 merge.py:201(get_result)
  8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)
    4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)
    1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)
  120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)
  266    0.004    0.000    0.067    0.000 common.py:733(take_nd)
  120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)
  120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)
   10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)
   25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)
   74    0.023    0.000    0.023    0.000 {pandas.algos.take_2d_axis1_object_object}
   50    0.022    0.000    0.022    0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
  120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)
   88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)
    1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)
  537    0.017    0.000    0.017    0.000 {built-in method numpy.core.multiarray.empty}
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)
   24    0.000    0.000    0.014    0.001 internals.py:3625(take)
   10    0.000    0.000    0.014    0.001 merge.py:157(__init__)
   10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)
   15    0.008    0.001    0.013    0.001 ops.py:662(na_op)
  234    0.000    0.000    0.013    0.000 common.py:158(isnull)
  234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)
   15    0.000    0.000    0.012    0.001 generic.py:1609(take)
   20    0.000    0.000    0.012    0.001 generic.py:2191(reindex)

The profiling by using Joins is as follows:

使用 Joins 的分析如下:

65079 function calls (63990 primitive calls) in 0.550 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)
    7    0.293    0.042    0.293    0.042 {built-in method gc.collect}
   10    0.000    0.000    0.173    0.017 generic.py:1842(drop)
   10    0.000    0.000    0.139    0.014 merge.py:26(merge)
  8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)
    4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)
   10    0.000    0.000    0.132    0.013 merge.py:201(get_result)
    5    0.000    0.000    0.122    0.024 frame.py:4324(join)
    5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)
    1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)
  100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)
  205    0.003    0.000    0.068    0.000 common.py:733(take_nd)
  100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)
  100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)
   52    0.031    0.001    0.031    0.001 {pandas.algos.take_2d_axis1_object_object}
    5    0.000    0.000    0.030    0.006 base.py:2329(join)
   37    0.001    0.000    0.027    0.001 internals.py:2754(apply)
    6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)
    7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)
    2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)
    7    0.000    0.000    0.021    0.003 generic.py:2950(copy)
    7    0.000    0.000    0.021    0.003 internals.py:3046(copy)
   84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)
   19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)
  100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)
    1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)
   83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)
   68    0.015    0.000    0.015    0.000 {method 'copy' of 'numpy.ndarray' objects}
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

如您所见,merge 比 join 快,虽然它是小值,但是超过 4000 次迭代,这个小值在几分钟内变成了一个巨大的数字。

Thanks

谢谢

回答by Reisepass

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of @julien-marrec Answer.

合并列上的 set_index 确实加快了速度。以下是@julien-marrec Answer 的稍微更现实的版本。

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

当要连接的列的整数在两个表上的顺序不同时,您仍然可以期望速度提高 8 倍。

回答by Julien Marrec

I suggest that you set your merge columns as index, and use df1.join(df2)instead of merge, it's much faster.

我建议您将合并列设置为索引,并使用df1.join(df2)而不是merge,它会快得多。

Here's some example including profiling:

这是一些示例,包括分析:

In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

Here's a regular left merge on A and A2:

这是 A 和 A2 上的常规左合并:

In [2]: %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')

1 loop, best of 3: 441 ms per loop

Here's the same, using join:

这是相同的,使用 join:

In [3]: %%timeit
        x = df1.set_index('A').join(df2.set_index('A2'), how='left')

1 loop, best of 3: 184 ms per loop

Now obviously if you can set the index before looping, the gain in terms of time will be much greater:

现在很明显,如果您可以在循环之前设置索引,则时间方面的收益会大得多:

# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)

CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms

Then in the loop, you'll get something that in this case is 30 times faster:

然后在循环中,你会得到在这种情况下快 30 倍的东西:

In [5]: %%timeit
        x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop