pandas 如何通过字符串匹配加速熊猫行过滤?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16384332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to speed up pandas row filtering by string matching?
提问by bigbug
I often need to filter pandas dataframe dfby df[df['col_name']=='string_value'], and I want to speed up the row selction operation, is there a quick way to do that ?
我经常需要过滤大Pandas据帧df通过df[df['col_name']=='string_value'],我想加快行selction操作,是有一个快速的方法来做到这一点?
For example,
例如,
In [1]: df = mul_df(3000,2000,3).reset_index()
In [2]: timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 1.52 s per loop
Can 1.52s be shorten ?
1.52s可以缩短吗?
Note:
笔记:
mul_df()is function to create multilevel dataframe:
mul_df()是创建多级数据框的函数:
>>> mul_df(4,2,3)
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 0.6399 0.0062 1.0022
B001 -0.2881 -2.0604 1.2481
A0001 B000 0.7070 -0.9539 -0.5268
B001 0.8860 -0.5367 -2.4492
A0002 B000 -2.4738 0.9529 -0.9789
B001 0.1392 -1.0931 -0.2077
A0003 B000 -1.1377 0.5455 -0.2290
B001 1.0083 0.2746 -0.3934
Below is the code of mul_df():
下面是 mul_df() 的代码:
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
回答by Wes McKinney
I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:
我一直想为 DataFrame 对象添加二进制搜索索引。您可以采用按列排序并自己执行此操作的 DIY 方法:
In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted
In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000
In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000
In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 μs per loop
This is fast because it always retrieves views and does not copy any data.
这很快,因为它总是检索视图并且不复制任何数据。
回答by DSM
Somewhat surprisingly, working with the .valuesarray instead of the Seriesis much faster for me:
有点令人惊讶的是,使用.values数组而不是Series对我来说要快得多:
>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop
回答by joris
Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupbyfunctionality can also make things faster (at least with the example).
取决于您之后要对选择做什么,如果您必须进行此类多项选择,该groupby功能还可以使事情变得更快(至少在示例中)。
Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):
即使您只需要为一个 string_value 选择行,它也会快一点(但不多):
In [11]: %timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 626 ms per loop
In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
1 loops, best of 3: 459 ms per loop
But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):
但是随后对 GroupBy 对象的调用将非常快(例如选择其他 sting_values 的行):
In [25]: grouped = df.groupby("STK_ID")
In [26]: %timeit grouped.get_group("A0003")
1 loops, best of 3: 333 us per loop

