Python 如何在熊猫中测试字符串是否包含列表中的子字符串之一？

Question

提问by ari

Is there any function that would be the equivalent of a combination of df.isin()and df[col].str.contains()?

有没有这将是一个组合的等同的任何功能df.isin()和df[col].str.contains()？

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where scontains any of ['og', 'at'], I would want to get everything but 'pet'.

例如，假设我有系列 s = pd.Series(['cat','hat','dog','fog','pet'])，并且我想找到s包含任何一个的所有地方['og', 'at']，我想获得除“宠物”之外的所有内容。

I have a solution, but it's rather inelegant:

我有一个解决方案，但它相当不雅：

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

有一个更好的方法吗？

Answer 1

采纳答案by Alex Riley

One option is just to use the regex |character to try to match each of the substrings in the words in your Series s(still using str.contains).

一种选择是使用正则表达式|字符尝试匹配系列中单词中的每个子字符串s（仍在使用str.contains）。

You can construct the regex by joining the words in searchforwith |:

您可以通过将单词searchfor与|以下内容连接来构建正则表达式：

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $and ^which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

正如@AndyHayden 在下面的评论中指出的那样，请注意您的子字符串是否具有特殊字符，例如$和^您想要逐字匹配的字符。这些字符在正则表达式的上下文中具有特定的含义，并且会影响匹配。

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

您可以通过转义非字母数字字符使您的子字符串列表更安全re.escape：

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\$money', 'x\^y']

The strings with in this new list will match each character literally when used with str.contains.

当与str.contains.

Answer 2

回答by l'L'l

You can use str.containsalone with a regex pattern using OR (|):

您可以使用以下方法str.contains单独使用正则表达式模式OR (|)：

s[s.str.contains('og|at')]

Or you could add the series to a dataframethen use str.contains:

或者您可以将系列添加到 adataframe然后使用str.contains：

df = pd.DataFrame(s)
df[s.str.contains('og|at')]

Output:

输出：

0 cat
1 hat
2 dog
3 fog

Answer 3

回答by Grant Shannon

Here is a one line lambda that also works:

这是一个也可以使用的单行 lambda：

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

输入：

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

应用 Lambda：

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

输出：

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0

Python 如何在熊猫中测试字符串是否包含列表中的子字符串之一？

提问by ari

采纳答案by Alex Riley

回答by l'L'l

回答by Grant Shannon

相关推荐

最近更新

标签

Python 如何在熊猫中测试字符串是否包含列表中的子字符串之一？

提问by ari

采纳答案by Alex Riley

回答by l'L'l

回答by Grant Shannon

相关推荐

python如何“否定”值：如果为真返回假，如果为假返回真

Python AttributeError: 'module' 对象没有属性 'request'

opencv `cv2` python 模块中缺少 CAP_PROP_FRAME_COUNT 常量

Python ().is_integer() 不工作

相关推荐

最近更新

标签