Python 如何在熊猫中测试字符串是否包含列表中的子字符串之一?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26577516/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:41:56  来源:igfitidea点击:

How to test if a string contains one of the substrings in a list, in pandas?

pythonstringpandasdataframematch

提问by ari

Is there any function that would be the equivalent of a combination of df.isin()and df[col].str.contains()?

有没有这将是一个组合的等同的任何功能df.isin()df[col].str.contains()

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where scontains any of ['og', 'at'], I would want to get everything but 'pet'.

例如,假设我有系列 s = pd.Series(['cat','hat','dog','fog','pet']),并且我想找到s包含任何一个的所有地方['og', 'at'],我想获得除“宠物”之外的所有内容。

I have a solution, but it's rather inelegant:

我有一个解决方案,但它相当不雅:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

有一个更好的方法吗?

采纳答案by Alex Riley

One option is just to use the regex |character to try to match each of the substrings in the words in your Series s(still using str.contains).

一种选择是使用正则表达式|字符尝试匹配系列中单词中的每个子字符串s(仍在使用str.contains)。

You can construct the regex by joining the words in searchforwith |:

您可以通过将单词searchfor|以下内容连接来构建正则表达式:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $and ^which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

正如@AndyHayden 在下面的评论中指出的那样,请注意您的子字符串是否具有特殊字符,例如$^您想要逐字匹配的字符。这些字符在正则表达式的上下文中具有特定的含义,并且会影响匹配。

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

您可以通过转义非字母数字字符使您的子字符串列表更安全re.escape

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\$money', 'x\^y']

The strings with in this new list will match each character literally when used with str.contains.

当与str.contains.

回答by l'L'l

You can use str.containsalone with a regex pattern using OR (|):

您可以使用以下方法str.contains单独使用正则表达式模式OR (|)

s[s.str.contains('og|at')]

Or you could add the series to a dataframethen use str.contains:

或者您可以将系列添加到 adataframe然后使用str.contains

df = pd.DataFrame(s)
df[s.str.contains('og|at')] 

Output:

输出:

0 cat
1 hat
2 dog
3 fog 

回答by Grant Shannon

Here is a one line lambda that also works:

这是一个也可以使用的单行 lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

输入:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

应用 Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

输出:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0