Pandas 过滤多个子串串联
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48541444/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas filtering for multiple substrings in series
提问by jpp
I need to filter rows in a pandas
dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.
我需要过滤pandas
数据框中的行,以便特定的字符串列至少包含提供的子字符串列表中的一个。子字符串可能有不寻常的/正则表达式字符。比较不应涉及正则表达式并且不区分大小写。
For example:
例如:
lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']
I currently apply the mask like this:
我目前应用这样的面具:
mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]
My dataframe is large (~1mio rows) and lst
has length 100. Is there a more efficient way? For example, if the first item in lst
is found, we should not have to test any subsequent strings for that row.
我的数据框很大(约 1mio 行)并且lst
长度为 100。有没有更有效的方法?例如,如果找到了第一项lst
,我们就不必测试该行的任何后续字符串。
回答by Alex Riley
If you're sticking to using pure-pandas, for both performance and practicality I think you shoulduse regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).
如果您坚持使用纯Pandas,为了性能和实用性,我认为您应该使用正则表达式来完成这项任务。但是,您首先需要正确转义子字符串中的任何特殊字符,以确保它们按字面匹配(而不是用作正则表达式元字符)。
This is easy to do using re.escape
:
这很容易使用re.escape
:
>>> import re
>>> esc_lst = [re.escape(s) for s in lst]
These escaped substrings can then be joined using a regex pipe |
. Each of the substrings can be checked against a string until one matches (or they have all been tested).
然后可以使用正则表达式管道连接这些转义的子字符串|
。可以根据字符串检查每个子字符串,直到匹配(或者它们都已被测试)。
>>> pattern = '|'.join(esc_lst)
The masking stage then becomes a single low-level loop through the rows:
屏蔽阶段然后变成通过行的单个低级循环:
df[col].str.contains(pattern, case=False)
Here's a simple setup to get a sense of performance:
这是一个简单的设置,可以了解性能:
from random import randint, seed
seed(321)
# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
建议的方法大约需要 1 秒(因此对于 100 万行可能最多需要 20 秒):
%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop
The method in the question took approximately 5 seconds using the same input data.
问题中的方法使用相同的输入数据大约需要 5 秒。
It's worth noting that these times are 'worst case' in the sense that there were no matches (so allsubstrings were checked). If there are matches than the timing will improve.
值得注意的是,这些时间是“最坏情况”,因为没有匹配项(因此检查了所有子字符串)。如果有比赛,时间会有所改善。
回答by unutbu
You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p)
where n
is length of the search strings and m
is the length of the searched text and p
is the number of output matches.
您可以尝试使用Aho-Corasick 算法。在一般情况下,O(n+m+p)
其中n
是搜索字符串m
的长度, 是搜索文本的长度,p
是输出匹配的数量。
The Aho-Corasick algorithm is often usedto find multiple patterns (needles) in an input text (the haystack).
Aho-Corasick 算法通常用于在输入文本(干草堆)中查找多个模式(针)。
pyahocorasickis a Python wrapper around a C implementation of the algorithm.
pyahocorasick是一个围绕该算法的 C 实现的 Python 包装器。
Let's compare how fast it is versus some alternatives. Below is a benchmark
showing using_aho_corasick
to be over 30x faster than the original method
(shown in the question) on a 50K-row DataFrame test case:
让我们比较一下它与一些替代方案的速度。以下是using_aho_corasick
在 50K 行 DataFrame 测试用例上显示比原始方法(如问题所示)快 30 倍以上的基准测试:
| | speed factor | ms per loop |
| | compared to orig | |
|--------------------+------------------+-------------|
| using_aho_corasick | 30.7x | 140 |
| using_regex | 2.7x | 1580 |
| orig | 1.0x | 4300 |
In [89]: %timeit using_ahocorasick(col, lst)
10 loops, best of 3: 140 ms per loop
In [88]: %timeit using_regex(col, lst)
1 loop, best of 3: 1.58 s per loop
In [91]: %timeit orig(col, lst)
1 loop, best of 3: 4.3 s per loop
Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig
:
这里用于基准测试的设置。它还验证输出是否与返回的结果匹配orig
:
import numpy as np
import random
import pandas as pd
import ahocorasick
import re
random.seed(321)
def orig(col, lst):
mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False)
for i in lst])
return mask
def using_regex(col, lst):
"""https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
mask = col.str.contains(pattern, case=False)
return mask
def using_ahocorasick(col, lst):
A = ahocorasick.Automaton(ahocorasick.STORE_INTS)
for word in lst:
A.add_word(word.lower())
A.make_automaton()
col = col.str.lower()
mask = col.apply(lambda x: bool(list(A.iter(x))))
return mask
N = 50000
# 100 substrings of 5 characters
lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# N strings of 20 characters
strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)]
# make about 10% of the strings match a string from lst; this helps check that our method works
strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]
col = pd.Series(strings)
expected = orig(col, lst)
for name, result in [('using_regex', using_regex(col, lst)),
('using_ahocorasick', using_ahocorasick(col, lst))]:
status = 'pass' if np.allclose(expected, result) else 'fail'
print('{}: {}'.format(name, status))
回答by pink.slash
Using a simpler example & ignore case(upper or lowercase)
使用更简单的示例并忽略大小写(大写或小写)
Filtering and getting a binary vector:
过滤并获取二元向量:
I want to find all elements of a pd.Series
, v
, that contain "at" or "Og". And get 1 if the element contains the pattern or 0 if it doesn't.
我想找到 a pd.Series
, v
, 中包含“at”或“Og”的所有元素。如果元素包含模式,则为 1,否则为 0。
re
re
:
import re
My vector:
我的向量:
v=pd.Series(['cAt','dog','the rat','mouse','froG'])
[Out]:
0 cAt
1 dog
2 the rat
3 mouse
4 froG
I want to find all elements of v that contain "at" or "Og".
This is, I can define my pattern
as:
我想找到 v 中包含“at”或“Og”的所有元素。这就是,我可以将我的定义pattern
为:
pattern='at|Og'
Since I want a vector with 1s if the item contains the pattern or 0 if don't.
因为如果项目包含模式,我想要一个带有 1 的向量,如果不包含则为 0。
I create an unitary vector with the same length as v:
我创建了一个与 v 长度相同的酉向量:
v_binary=[1]*len(v)
I obtain a boolenean s
that is True
if one element of v
contains the pattern
or False
if it doesn't contain it.
我得到一个布尔值s
,即True
如果一个元素v
包含pattern
或False
不包含它。
s=v.str.contains(pattern, flags=re.IGNORECASE, regex=True)
To obtain the binary vector I multiply the v_binary
*s
:
为了获得二进制向量,我乘以v_binary
* s
:
v_binary*s
[Out]
0 1
1 1
2 1
3 0
4 1