Pandas 过滤多个子串串联

Question

提问by jpp

I need to filter rows in a pandasdataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

我需要过滤pandas数据框中的行，以便特定的字符串列至少包含提供的子字符串列表中的一个。子字符串可能有不寻常的/正则表达式字符。比较不应涉及正则表达式并且不区分大小写。

For example:

例如：

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

我目前应用这样的面具：

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]

My dataframe is large (~1mio rows) and lsthas length 100. Is there a more efficient way? For example, if the first item in lstis found, we should not have to test any subsequent strings for that row.

我的数据框很大（约 1mio 行）并且lst长度为 100。有没有更有效的方法？例如，如果找到了第一项lst，我们就不必测试该行的任何后续字符串。

Answer 1

回答by Alex Riley

If you're sticking to using pure-pandas, for both performance and practicality I think you shoulduse regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

如果您坚持使用纯Pandas，为了性能和实用性，我认为您应该使用正则表达式来完成这项任务。但是，您首先需要正确转义子字符串中的任何特殊字符，以确保它们按字面匹配（而不是用作正则表达式元字符）。

This is easy to do using re.escape:

这很容易使用re.escape：

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

然后可以使用正则表达式管道连接这些转义的子字符串|。可以根据字符串检查每个子字符串，直到匹配（或者它们都已被测试）。

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

屏蔽阶段然后变成通过行的单个低级循环：

df[col].str.contains(pattern, case=False)

Here's a simple setup to get a sense of performance:

这是一个简单的设置，可以了解性能：

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

建议的方法大约需要 1 秒（因此对于 100 万行可能最多需要 20 秒）：

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

问题中的方法使用相同的输入数据大约需要 5 秒。

It's worth noting that these times are 'worst case' in the sense that there were no matches (so allsubstrings were checked). If there are matches than the timing will improve.

值得注意的是，这些时间是“最坏情况”，因为没有匹配项（因此检查了所有子字符串）。如果有比赛，时间会有所改善。

Answer 2

回答by unutbu

You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p)where nis length of the search strings and mis the length of the searched text and pis the number of output matches.

您可以尝试使用Aho-Corasick 算法。在一般情况下，O(n+m+p)其中n是搜索字符串m的长度，是搜索文本的长度，p是输出匹配的数量。

The Aho-Corasick algorithm is often usedto find multiple patterns (needles) in an input text (the haystack).

Aho-Corasick 算法通常用于在输入文本（干草堆）中查找多个模式（针）。

pyahocorasickis a Python wrapper around a C implementation of the algorithm.

pyahocorasick是一个围绕该算法的 C 实现的 Python 包装器。

Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasickto be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:

让我们比较一下它与一些替代方案的速度。以下是using_aho_corasick在 50K 行 DataFrame 测试用例上显示比原始方法（如问题所示）快 30 倍以上的基准测试：

|                    |     speed factor | ms per loop |
|                    | compared to orig |             |
|--------------------+------------------+-------------|
| using_aho_corasick |            30.7x |         140 |
| using_regex        |             2.7x |        1580 |
| orig               |             1.0x |        4300 |

In [89]: %timeit using_ahocorasick(col, lst)
10 loops, best of 3: 140 ms per loop

In [88]: %timeit using_regex(col, lst)
1 loop, best of 3: 1.58 s per loop

In [91]: %timeit orig(col, lst)
1 loop, best of 3: 4.3 s per loop

Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig:

这里用于基准测试的设置。它还验证输出是否与返回的结果匹配orig：

import numpy as np
import random
import pandas as pd
import ahocorasick
import re

random.seed(321)

def orig(col, lst):
    mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False) 
                                 for i in lst])
    return mask

def using_regex(col, lst):
    """https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""
    esc_lst = [re.escape(s) for s in lst]
    pattern = '|'.join(esc_lst)
    mask = col.str.contains(pattern, case=False)
    return mask

def using_ahocorasick(col, lst):
    A = ahocorasick.Automaton(ahocorasick.STORE_INTS)
    for word in lst:
        A.add_word(word.lower())
    A.make_automaton() 
    col = col.str.lower()
    mask = col.apply(lambda x: bool(list(A.iter(x))))
    return mask

N = 50000
# 100 substrings of 5 characters
lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# N strings of 20 characters
strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)]
# make about 10% of the strings match a string from lst; this helps check that our method works
strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]

col = pd.Series(strings)

expected = orig(col, lst)
for name, result in [('using_regex', using_regex(col, lst)),
                     ('using_ahocorasick', using_ahocorasick(col, lst))]:
    status = 'pass' if np.allclose(expected, result) else 'fail'
    print('{}: {}'.format(name, status))

Answer 3

回答by pink.slash

Using a simpler example & ignore case(upper or lowercase)

使用更简单的示例并忽略大小写（大写或小写）

Filtering and getting a binary vector:

过滤并获取二元向量：

I want to find all elements of a pd.Series, v, that contain "at" or "Og". And get 1 if the element contains the pattern or 0 if it doesn't.

我想找到 a pd.Series, v, 中包含“at”或“Og”的所有元素。如果元素包含模式，则为 1，否则为 0。

我会用 rere：

import re

My vector:

我的向量：

v=pd.Series(['cAt','dog','the rat','mouse','froG'])

[Out]:

0        cAt
1        dog
2    the rat
3      mouse
4       froG

I want to find all elements of v that contain "at" or "Og". This is, I can define my patternas:

我想找到 v 中包含“at”或“Og”的所有元素。这就是，我可以将我的定义pattern为：

pattern='at|Og'

Since I want a vector with 1s if the item contains the pattern or 0 if don't.

因为如果项目包含模式，我想要一个带有 1 的向量，如果不包含则为 0。

I create an unitary vector with the same length as v:

我创建了一个与 v 长度相同的酉向量：

v_binary=[1]*len(v)

I obtain a boolenean sthat is Trueif one element of vcontains the patternor Falseif it doesn't contain it.

我得到一个布尔值s，即True如果一个元素v包含pattern或False不包含它。

s=v.str.contains(pattern, flags=re.IGNORECASE, regex=True)

To obtain the binary vector I multiply the v_binary*s:

为了获得二进制向量，我乘以v_binary* s：

v_binary*s

[Out]

0    1
1    1
2    1
3    0
4    1

Pandas 过滤多个子串串联

提问by jpp

回答by Alex Riley

回答by unutbu

回答by pink.slash

Using a simpler example & ignore case(upper or lowercase)

使用更简单的示例并忽略大小写（大写或小写）

Filtering and getting a binary vector:

过滤并获取二元向量：

相关推荐

最近更新

标签

Pandas 过滤多个子串串联

提问by jpp

回答by Alex Riley

回答by unutbu

回答by pink.slash

Using a simpler example & ignore case(upper or lowercase)

使用更简单的示例并忽略大小写（大写或小写）

Filtering and getting a binary vector:

过滤并获取二元向量：

相关推荐

pandas 熊猫从列中选择唯一值

pandas python中电机振动信号的快速傅里叶变换

带有 lambda 函数的 Pandas .filter() 方法

Pandas to_sql 不会在我的表中插入任何数据

相关推荐

最近更新

标签