Pandas str.contains - 在字符串中搜索多个值并在新列中打印值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48631769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas str.contains - Search for multiple values in a string and print the values in a new column
提问by jpp
I just started coding in Python and want to build a solution where you would search a string to see if it contains a given set of values.
我刚开始用 Python 编码,想构建一个解决方案,您可以在其中搜索字符串以查看它是否包含给定的一组值。
I've find a similar solution in R which uses the stringr library: Search for a value in a string and if the value exists, print it all by itself in a new column
我在 R 中找到了一个类似的解决方案,它使用 stringr 库:搜索字符串中的值,如果该值存在,则将其全部打印在新列中
The following code seems to work but i also want to output the three values that i'm looking for and this solution will only output one value:
以下代码似乎有效,但我也想输出我正在寻找的三个值,而此解决方案只会输出一个值:
#Inserting new column
df.insert(5, "New_Column", np.nan)
#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')
------ Edit ------
- - - 编辑 - - -
So i realised i didn't give that good of an explanation, sorry about that.
所以我意识到我没有给出那么好的解释,对此我很抱歉。
Below is an example where i match fruit names in a string and depending on if it finds any matches in the string it will print out either true or false in a new column. Here's my question: Instead of printing out true or false i want to print out the name it found in the string eg. apples, oranges etc.
下面是一个示例,其中我匹配字符串中的水果名称,并且根据它是否在字符串中找到任何匹配项,它将在新列中打印出 true 或 false。这是我的问题:我想打印出它在字符串中找到的名称,而不是打印出真或假。苹果、橙子等。
import pandas as pd
import numpy as np
text = [('I want to buy some apples.', 0),
('Oranges are good for the health.', 0),
('John is eating some grapes.', 0),
('This line does not contain any fruit names.', 0),
('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']
df = pd.DataFrame.from_records(text, columns=labels)
df.insert(2, "MatchedValues", np.nan)
foods =['apples', 'oranges', 'grapes', 'blueberries']
pattern = '|'.join(foods)
df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)
print(df)
Result
结果
Text Random Column MatchedValues
0 I want to buy some apples. 0 True
1 Oranges are good for the health. 0 True
2 John is eating some grapes. 0 True
3 This line does not contain any fruit names. 0 False
4 I bought 2 blueberries yesterday. 0 True
Wanted result
想要的结果
Text Random Column MatchedValues
0 I want to buy some apples. 0 apples
1 Oranges are good for the health. 0 oranges
2 John is eating some grapes. 0 grapes
3 This line does not contain any fruit names. 0 NaN
4 I bought 2 blueberries yesterday. 0 blueberries
回答by adr
You need to set the regex flag (to interpret your search as a regular expression):
您需要设置正则表达式标志(将您的搜索解释为正则表达式):
whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
case=False, regex=True)
df['New_Column'] = np.where(whatIwant, df['Column_with_text'])
------ Edit ------
- - - 编辑 - - -
Based on the updated problem statement, here is an updated answer:
根据更新的问题陈述,这里是更新的答案:
You need to define a capture group in the regular expression using parentheses and use the extract()function to return the values found within the capture group. The lower()function deals with any upper case letters
您需要使用括号在正则表达式中定义一个捕获组,并使用该extract()函数返回在捕获组中找到的值。该lower()函数处理任何大写字母
df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)
回答by jpp
Here is one way:
这是一种方法:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries

