pandas 如何通过字符串匹配加速熊猫行过滤？

Question

提问by bigbug

I often need to filter pandas dataframe dfby df[df['col_name']=='string_value'], and I want to speed up the row selction operation, is there a quick way to do that ?

我经常需要过滤大Pandas据帧df通过df[df['col_name']=='string_value']，我想加快行selction操作，是有一个快速的方法来做到这一点？

For example,

例如，

In [1]: df = mul_df(3000,2000,3).reset_index()

In [2]: timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 1.52 s per loop

Can 1.52s be shorten ?

1.52s可以缩短吗？

Note:

笔记：

mul_df()is function to create multilevel dataframe:

mul_df()是创建多级数据框的函数：

>>> mul_df(4,2,3)
                 COL000  COL001  COL002
STK_ID RPT_Date                        
A0000  B000      0.6399  0.0062  1.0022
       B001     -0.2881 -2.0604  1.2481
A0001  B000      0.7070 -0.9539 -0.5268
       B001      0.8860 -0.5367 -2.4492
A0002  B000     -2.4738  0.9529 -0.9789
       B001      0.1392 -1.0931 -0.2077
A0003  B000     -1.1377  0.5455 -0.2290
       B001      1.0083  0.2746 -0.3934

Below is the code of mul_df():

下面是 mul_df() 的代码：

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

Answer 1

回答by Wes McKinney

I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:

我一直想为 DataFrame 对象添加二进制搜索索引。您可以采用按列排序并自己执行此操作的 DIY 方法：

In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted

In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000

In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000

In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 μs per loop

This is fast because it always retrieves views and does not copy any data.

这很快，因为它总是检索视图并且不复制任何数据。

Answer 2

回答by DSM

Somewhat surprisingly, working with the .valuesarray instead of the Seriesis much faster for me:

有点令人惊讶的是，使用.values数组而不是Series对我来说要快得多：

>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop

Answer 3

回答by joris

Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupbyfunctionality can also make things faster (at least with the example).

取决于您之后要对选择做什么，如果您必须进行此类多项选择，该groupby功能还可以使事情变得更快（至少在示例中）。

Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):

即使您只需要为一个 string_value 选择行，它也会快一点（但不多）：

In [11]: %timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 626 ms per loop

In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
1 loops, best of 3: 459 ms per loop

But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):

但是随后对 GroupBy 对象的调用将非常快（例如选择其他 sting_values 的行）：

In [25]: grouped = df.groupby("STK_ID")

In [26]: %timeit grouped.get_group("A0003")
1 loops, best of 3: 333 us per loop

pandas 如何通过字符串匹配加速熊猫行过滤？

提问by bigbug

回答by Wes McKinney

回答by DSM

回答by joris

相关推荐

最近更新

标签

pandas 如何通过字符串匹配加速熊猫行过滤？

提问by bigbug

回答by Wes McKinney

回答by DSM

回答by joris

相关推荐

在 group() 上的 Pandas 中使用 cumsum

Python 中带有 Pandas + statsmodels 的 VAR 模型

pandas 熊猫对 HDFStore 中大数据的“分组依据”查询？

pandas.to_datetime 不一致的时间字符串格式

相关推荐

最近更新

标签