Python 熊猫数据帧 str.contains() AND 操作

Question

提问by Aerin

df (Pandas Dataframe) has three rows.

df (Pandas Dataframe) 有三行。

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

df.col_name.str.contains("apple|banana")

will catch all of the rows:

将捕获所有行：

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator on str.contains method, so that it only grabs strings that contain BOTH apple & banana?

我如何在 str.contains 方法上应用 AND 运算符，以便它只抓取包含苹果和香蕉的字符串？

"apple and banana both are delicious"

I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)

我想抓取包含 10-20 个不同单词的字符串（葡萄、西瓜、浆果、橙子等）

Answer 1

回答by flyingmeatball

You can do that as follows:

你可以这样做：

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

Answer 2

回答by Alexander

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

Answer 3

回答by Anzel

You can also do it in regex expression style:

你也可以用正则表达式风格来做：

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

然后，您可以将单词列表构建为正则表达式字符串，如下所示：

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

将呈现：

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

然后你可以动态地做你的事情。

Answer 4

回答by Charan Reddy

This works

这有效

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

Answer 5

回答by Sergey Zakharov

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

如果您只想使用本机方法并避免编写正则表达式，这里有一个不涉及 lambda 的矢量化版本：

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

Answer 6

回答by pmaniyan

Try this regex

试试这个正则表达式

apple.*banana|banana.*apple

Code is:

代码是：

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

Output

输出

   ID                           String_Col
2   3  apple and banana both are delicious

Answer 7

回答by Siraj S.

if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :

如果你想在句子中至少抓住两个词，也许这会奏效（从@Alexander 那里得到提示）：

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

output:

输出：

                                   col
2  apple and banana both are delicious

if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any

如果您有两个以上的单词要捕捉，并用逗号“,”分隔，然后将其添加到 connector_list 并将第二个条件从 all 修改为 any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

output:

输出：

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

Answer 8

回答by pault

Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce()and the bitwise ANDoperator (&).

枚举大型列表的所有可能性很麻烦。更好的方法是使用reduce()和按位与运算符 ( &)。

For example, consider the following DataFrame:

例如，考虑以下 DataFrame：

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

Suppose we wanted to search for all of the following:

假设我们要搜索以下所有内容：

targets = ['apple', 'banana', 'strawberry']

We can do:

我们可以做的：

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

Python 熊猫数据帧 str.contains() AND 操作

提问by Aerin

回答by flyingmeatball

回答by Alexander

回答by Anzel

回答by Charan Reddy

回答by Sergey Zakharov

回答by pmaniyan

回答by Siraj S.

回答by pault

相关推荐

最近更新

标签

Python 熊猫数据帧 str.contains() AND 操作

提问by Aerin

回答by flyingmeatball

回答by Alexander

回答by Anzel

回答by Charan Reddy

回答by Sergey Zakharov

回答by pmaniyan

回答by Siraj S.

回答by pault

相关推荐

Anaconda / Python：更改 Anaconda 提示用户路径

Python - socket.error：无法分配请求的地址

Python 如何在 OpenPyXL 中使用 column=numbers 而不是字母读取单元格范围？

Python ValueError：无法设置没有定义索引的框架和无法转换为系列的值

相关推荐

最近更新

标签