Pandas str.contains - 在字符串中搜索多个值并在新列中打印值

Question

提问by jpp

I just started coding in Python and want to build a solution where you would search a string to see if it contains a given set of values.

我刚开始用 Python 编码，想构建一个解决方案，您可以在其中搜索字符串以查看它是否包含给定的一组值。

I've find a similar solution in R which uses the stringr library: Search for a value in a string and if the value exists, print it all by itself in a new column

我在 R 中找到了一个类似的解决方案，它使用 stringr 库：搜索字符串中的值，如果该值存在，则将其全部打印在新列中

The following code seems to work but i also want to output the three values that i'm looking for and this solution will only output one value:

以下代码似乎有效，但我也想输出我正在寻找的三个值，而此解决方案只会输出一个值：

#Inserting new column
df.insert(5, "New_Column", np.nan)

#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')

------ Edit ------

- - - 编辑 - - -

So i realised i didn't give that good of an explanation, sorry about that.

所以我意识到我没有给出那么好的解释，对此我很抱歉。

Below is an example where i match fruit names in a string and depending on if it finds any matches in the string it will print out either true or false in a new column. Here's my question: Instead of printing out true or false i want to print out the name it found in the string eg. apples, oranges etc.

下面是一个示例，其中我匹配字符串中的水果名称，并且根据它是否在字符串中找到任何匹配项，它将在新列中打印出 true 或 false。这是我的问题：我想打印出它在字符串中找到的名称，而不是打印出真或假。苹果、橙子等。

import pandas as pd
import numpy as np

text = [('I want to buy some apples.', 0),
         ('Oranges are good for the health.', 0),
         ('John is eating some grapes.', 0),
         ('This line does not contain any fruit names.', 0),
         ('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']

df = pd.DataFrame.from_records(text, columns=labels)

df.insert(2, "MatchedValues", np.nan)

foods =['apples', 'oranges', 'grapes', 'blueberries']

pattern = '|'.join(foods)

df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)

print(df)

Result

结果

                                          Text  Random Column  MatchedValues
0                   I want to buy some apples.              0           True
1             Oranges are good for the health.              0           True
2                  John is eating some grapes.              0           True
3  This line does not contain any fruit names.              0          False
4            I bought 2 blueberries yesterday.              0           True

Wanted result

想要的结果

                                          Text  Random Column  MatchedValues
0                   I want to buy some apples.              0           apples
1             Oranges are good for the health.              0           oranges
2                  John is eating some grapes.              0           grapes
3  This line does not contain any fruit names.              0          NaN
4            I bought 2 blueberries yesterday.              0           blueberries

Answer 1

回答by adr

You need to set the regex flag (to interpret your search as a regular expression):

您需要设置正则表达式标志（将您的搜索解释为正则表达式）：

whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
                                                 case=False, regex=True)

df['New_Column'] = np.where(whatIwant, df['Column_with_text'])

------ Edit ------

- - - 编辑 - - -

Based on the updated problem statement, here is an updated answer:

根据更新的问题陈述，这里是更新的答案：

You need to define a capture group in the regular expression using parentheses and use the extract()function to return the values found within the capture group. The lower()function deals with any upper case letters

您需要使用括号在正则表达式中定义一个捕获组，并使用该extract()函数返回在捕获组中找到的值。该lower()函数处理任何大写字母

df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)

Answer 2

回答by jpp

Here is one way:

这是一种方法：

foods =['apples', 'oranges', 'grapes', 'blueberries']

def matcher(x):
    for i in foods:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

df['Match'] = df['Text'].apply(matcher)

#                                           Text        Match
# 0                   I want to buy some apples.       apples
# 1             Oranges are good for the health.      oranges
# 2                  John is eating some grapes.       grapes
# 3  This line does not contain any fruit names.          NaN
# 4            I bought 2 blueberries yesterday.  blueberries

Pandas str.contains - 在字符串中搜索多个值并在新列中打印值

提问by jpp

回答by adr

回答by jpp

相关推荐

最近更新

标签

Pandas str.contains - 在字符串中搜索多个值并在新列中打印值

提问by jpp

回答by adr

回答by jpp

相关推荐

pandas 对熊猫数据框进行子集化的最佳方法

pandas Matplotlib 绘图：AttributeError：'list' 对象没有属性 'xaxis'

pandas 数据框检查索引是否存在于多索引中

如何使用 Pandas 数据框打开 csv 文件

相关推荐

最近更新

标签