python pandas.Series.str.contains整个词

Question

提问by Aerin

df (Pandas Dataframe) has three rows.

df (Pandas Dataframe) 有三行。

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

I'd like to extract the row that contains "is" and "small".

我想提取包含“是”和“小”的行。

If I do

如果我做

df.col_name.str.contains("is|small", case=False)

Then it catches "His" as well- which I don't want.

然后它也会捕捉到“他的”——这是我不想要的。

Is below query is the right way to catch the whole word in df.series?

下面的查询是在 df.series 中捕获整个单词的正确方法吗？

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

Answer 1

回答by Laurel

No, the regex /bis/b|/bsmall/bwill fail because you are using /b, not the \bthat means "word boundary".

不，正则表达式/bis/b|/bsmall/b会失败，因为您使用的是/b，而不是\b那意味着“词边界”。

Change that and you get a match. I would recommend using

改变它，你就会得到一个匹配。我建议使用

\b(is|small)\b

That regex is a little faster and a little more legible, at least to me.

那个正则表达式更快，更清晰，至少对我来说是这样。

Answer 2

回答by Alexander

First, you may want to convert everything to lowercase, remove punctuation and whitespace and then convert the result into a set of words.

首先，您可能希望将所有内容都转换为小写，删除标点符号和空格，然后将结果转换为一组单词。

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

You can now use boolean indexing to see if all of your target words are in these new word sets.

您现在可以使用布尔索引来查看您的所有目标词是否都在这些新词集中。

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False

To extract the matching rows:

提取匹配的行：

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

To make this all into a single statement using boolean indexing:

要使用布尔索引将这一切变成一个单一的语句：

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

Answer 3

回答by Mitali Cyrus

In "\bis\b|\bsmall\b", the backslash \bis parsed as ASCII Backspace before it is even passed to the regular expression method for matching/searching. For more information check this document about escape characters. It is mentioned in this document, that

在中"\bis\b|\bsmall\b"，反斜杠\b在传递给正则表达式方法进行匹配/搜索之前被解析为 ASCII Backspace。有关更多信息，请查看有关转义字符的文档。该文件中提到，

When an ‘r' or ‘R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

当存在 'r' 或 'R' 前缀时，反斜杠后面的字符将包含在字符串中而不会更改，并且所有反斜杠都保留在字符串中。

Use rprefix

使用r前缀

df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)

Escape the \character -

逃脱\角色——

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

If you want to see an example, here is the Fiddle

如果你想看一个例子，这里是小提琴

Answer 4

回答by szeitlin

Your way (with /b) didn't work for me. I'm not sure why you can't use the logical operator and (&) since I think that's what you actually want.

你的方式（使用 /b）对我不起作用。我不确定为什么您不能使用逻辑运算符和 (&)，因为我认为这正是您真正想要的。

This is a silly way to do it, but it works:

这是一种愚蠢的方法，但它有效：

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)

python pandas.Series.str.contains整个词

提问by Aerin

回答by Laurel

回答by Alexander

回答by Mitali Cyrus

回答by szeitlin

相关推荐

最近更新

标签

python pandas.Series.str.contains整个词

提问by Aerin

回答by Laurel

回答by Alexander

回答by Mitali Cyrus

回答by szeitlin

相关推荐

用正则表达式替换引号、逗号、撇号 - python/pandas

Pandas DataFrame 按分类列排序，但按特定类排序

如何拆分“数字”以分隔 Pandas DataFrame 中的列

在 Pandas 索引对象的末尾添加一个值

相关推荐

最近更新

标签