python pandas.Series.str.contains整个词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39359601/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:58:04  来源:igfitidea点击:

python pandas.Series.str.contains WHOLE WORD

pythonregexpandasdataframe

提问by Aerin

df (Pandas Dataframe) has three rows.

df (Pandas Dataframe) 有三行。

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

I'd like to extract the row that contains "is" and "small".

我想提取包含“是”和“小”的行。

If I do

如果我做

df.col_name.str.contains("is|small", case=False)

Then it catches "His" as well- which I don't want.

然后它也会捕捉到“他的”——这是我不想要的。

Is below query is the right way to catch the whole word in df.series?

下面的查询是在 df.series 中捕获整个单词的正确方法吗?

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

回答by Laurel

No, the regex /bis/b|/bsmall/bwill fail because you are using /b, not the \bthat means "word boundary".

不,正则表达式/bis/b|/bsmall/b会失败,因为您使用的是/b,而不是\b那意味着“词边界”。

Change that and you get a match. I would recommend using

改变它,你就会得到一个匹配。我建议使用

\b(is|small)\b

That regex is a little faster and a little more legible, at least to me.

那个正则表达式更快,更清晰,至少对我来说是这样。

回答by Alexander

First, you may want to convert everything to lowercase, remove punctuation and whitespace and then convert the result into a set of words.

首先,您可能希望将所有内容都转换为小写,删除标点符号和空格,然后将结果转换为一组单词。

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

You can now use boolean indexing to see if all of your target words are in these new word sets.

您现在可以使用布尔索引来查看您的所有目标词是否都在这些新词集中。

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

To extract the matching rows:

提取匹配的行:

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

To make this all into a single statement using boolean indexing:

要使用布尔索引将这一切变成一个单一的语句:

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

回答by Mitali Cyrus

In "\bis\b|\bsmall\b", the backslash \bis parsed as ASCII Backspace before it is even passed to the regular expression method for matching/searching. For more information check this document about escape characters. It is mentioned in this document, that

在 中"\bis\b|\bsmall\b",反斜杠\b在传递给正则表达式方法进行匹配/搜索之前被解析为 ASCII Backspace。有关更多信息,请查看有关转义字符的文档。该文件中提到,

When an ‘r' or ‘R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

当存在 'r' 或 'R' 前缀时,反斜杠后面的字符将包含在字符串中而不会更改,并且所有反斜杠都保留在字符串中。

  1. Use rprefix
  1. 使用r前缀
df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)
  1. Escape the \character -
  1. 逃脱\角色——
df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

If you want to see an example, here is the Fiddle

如果你想看一个例子,这里是小提琴

回答by szeitlin

Your way (with /b) didn't work for me. I'm not sure why you can't use the logical operator and (&) since I think that's what you actually want.

你的方式(使用 /b)对我不起作用。我不确定为什么您不能使用逻辑运算符和 (&),因为我认为这正是您真正想要的。

This is a silly way to do it, but it works:

这是一种愚蠢的方法,但它有效:

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)