Python 熊猫数据帧 str.contains() AND 操作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37011734/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas dataframe str.contains() AND operation
提问by Aerin
df (Pandas Dataframe) has three rows.
df (Pandas Dataframe) 有三行。
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
df.col_name.str.contains("apple|banana")
df.col_name.str.contains("apple|banana")
will catch all of the rows:
将捕获所有行:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
How do I apply AND operator on str.contains method, so that it only grabs strings that contain BOTH apple & banana?
我如何在 str.contains 方法上应用 AND 运算符,以便它只抓取包含苹果和香蕉的字符串?
"apple and banana both are delicious"
I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)
我想抓取包含 10-20 个不同单词的字符串(葡萄、西瓜、浆果、橙子等)
回答by flyingmeatball
You can do that as follows:
你可以这样做:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
回答by Alexander
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
回答by Anzel
You can also do it in regex expression style:
你也可以用正则表达式风格来做:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
然后,您可以将单词列表构建为正则表达式字符串,如下所示:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
将呈现:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
然后你可以动态地做你的事情。
回答by Charan Reddy
This works
这有效
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
回答by Sergey Zakharov
If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:
如果您只想使用本机方法并避免编写正则表达式,这里有一个不涉及 lambda 的矢量化版本:
targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
回答by pmaniyan
Try this regex
试试这个正则表达式
apple.*banana|banana.*apple
Code is:
代码是:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
Output
输出
ID String_Col
2 3 apple and banana both are delicious
回答by Siraj S.
if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :
如果你想在句子中至少抓住两个词,也许这会奏效(从@Alexander 那里得到提示):
target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
output:
输出:
col
2 apple and banana both are delicious
if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any
如果您有两个以上的单词要捕捉,并用逗号“,”分隔,然后将其添加到 connector_list 并将第二个条件从 all 修改为 any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
output:
输出:
col
2 apple and banana both are delicious
3 orange,banana and apple all are delicious
回答by pault
Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce()
and the bitwise ANDoperator (&
).
枚举大型列表的所有可能性很麻烦。更好的方法是使用reduce()
和按位与运算符 ( &
)。
For example, consider the following DataFrame:
例如,考虑以下 DataFrame:
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"]})
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
Suppose we wanted to search for all of the following:
假设我们要搜索以下所有内容:
targets = ['apple', 'banana', 'strawberry']
We can do:
我们可以做的:
#from functools import reduce # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])
# col
#3 i love apple, banana, and strawberry