pandas 数据框:loc 与查询性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49936557/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:29:50  来源:igfitidea点击:

pandas dataframe: loc vs query performance

pythonperformancepandasdataframeindexing

提问by Syntax_Error

I have 2 dataframes in python that I would like to query for data.

我在 python 中有 2 个数据框,我想查询数据。

  • DF1: 4M records x 3 columns. The query function seams more efficient than the loc function.

  • DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

  • DF1:4M 记录 x 3 列。查询函数比 loc 函数更高效。

  • DF2:2K 记录 x 6 列。loc 函数比查询函数更高效。

Both queries return a single record. The simulation was done by running the same operation in a loop 10K times.

两个查询都返回一条记录。模拟是通过在循环中运行相同的操作 10K 次来完成的。

Running python 2.7 and pandas 0.16.0

运行 python 2.7 和 Pandas 0.16.0

Any recommendations to improve the query speed?

有什么建议可以提高查询速度?

回答by jezrael

For improve performance is possible use numexpr:

为了提高性能,可以使用numexpr

import numexpr

np.random.seed(125)
N = 40000000
df = pd.DataFrame({'A':np.random.randint(10, size=N)})

def ne(df):
    x = df.A.values
    return df[numexpr.evaluate('(x > 5)')]
print (ne(df))

In [138]: %timeit (ne(df))
1 loop, best of 3: 494 ms per loop

In [139]: %timeit df[df.A > 5]
1 loop, best of 3: 536 ms per loop

In [140]: %timeit df.query('A > 5')
1 loop, best of 3: 781 ms per loop

In [141]: %timeit df[df.eval('A > 5')]
1 loop, best of 3: 770 ms per loop


import numexpr
np.random.seed(125)

def ne(x):
    x = x.A.values
    return x[numexpr.evaluate('(x > 5)')]

def be(x):
    return x[x.A > 5]

def q(x):
    return x.query('A > 5')

def ev(x):
    return x[x.eval('A > 5')]


def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[ne, be, q, ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

graph

图形