Python 如何在熊猫中测试字符串是否包含列表中的子字符串之一?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26577516/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to test if a string contains one of the substrings in a list, in pandas?
提问by ari
Is there any function that would be the equivalent of a combination of df.isin()and df[col].str.contains()?
有没有这将是一个组合的等同的任何功能df.isin()和df[col].str.contains()?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where scontains any of ['og', 'at'], I would want to get everything but 'pet'.
例如,假设我有系列
s = pd.Series(['cat','hat','dog','fog','pet']),并且我想找到s包含任何一个的所有地方['og', 'at'],我想获得除“宠物”之外的所有内容。
I have a solution, but it's rather inelegant:
我有一个解决方案,但它相当不雅:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
Is there a better way to do this?
有一个更好的方法吗?
采纳答案by Alex Riley
One option is just to use the regex |character to try to match each of the substrings in the words in your Series s(still using str.contains).
一种选择是使用正则表达式|字符尝试匹配系列中单词中的每个子字符串s(仍在使用str.contains)。
You can construct the regex by joining the words in searchforwith |:
您可以通过将单词searchfor与|以下内容连接来构建正则表达式:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $and ^which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
正如@AndyHayden 在下面的评论中指出的那样,请注意您的子字符串是否具有特殊字符,例如$和^您想要逐字匹配的字符。这些字符在正则表达式的上下文中具有特定的含义,并且会影响匹配。
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
您可以通过转义非字母数字字符使您的子字符串列表更安全re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\$money', 'x\^y']
The strings with in this new list will match each character literally when used with str.contains.
当与str.contains.
回答by l'L'l
You can use str.containsalone with a regex pattern using OR (|):
您可以使用以下方法str.contains单独使用正则表达式模式OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframethen use str.contains:
或者您可以将系列添加到 adataframe然后使用str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
输出:
0 cat
1 hat
2 dog
3 fog
回答by Grant Shannon
Here is a one line lambda that also works:
这是一个也可以使用的单行 lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Input:
输入:
searchfor = ['og', 'at']
df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
col1 col2
0 cat 1000.0
1 hat 2000000.0
2 dog 1000.0
3 fog 330000.0
4 pet 330000.0
Apply Lambda:
应用 Lambda:
df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)
Output:
输出:
col1 col2 TrueFalse
0 cat 1000.0 1
1 hat 2000000.0 1
2 dog 1000.0 1
3 fog 330000.0 1
4 pet 330000.0 0

