Pandas str.contains - 在字符串中搜索多个值并在新列中打印值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48631769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas str.contains - Search for multiple values in a string and print the values in a new column
提问by jpp
I just started coding in Python and want to build a solution where you would search a string to see if it contains a given set of values.
我刚开始用 Python 编码,想构建一个解决方案,您可以在其中搜索字符串以查看它是否包含给定的一组值。
I've find a similar solution in R which uses the stringr library: Search for a value in a string and if the value exists, print it all by itself in a new column
我在 R 中找到了一个类似的解决方案,它使用 stringr 库:搜索字符串中的值,如果该值存在,则将其全部打印在新列中
The following code seems to work but i also want to output the three values that i'm looking for and this solution will only output one value:
以下代码似乎有效,但我也想输出我正在寻找的三个值,而此解决方案只会输出一个值:
#Inserting new column
df.insert(5, "New_Column", np.nan)
#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')
------ Edit ------
- - - 编辑 - - -
So i realised i didn't give that good of an explanation, sorry about that.
所以我意识到我没有给出那么好的解释,对此我很抱歉。
Below is an example where i match fruit names in a string and depending on if it finds any matches in the string it will print out either true or false in a new column. Here's my question: Instead of printing out true or false i want to print out the name it found in the string eg. apples, oranges etc.
下面是一个示例,其中我匹配字符串中的水果名称,并且根据它是否在字符串中找到任何匹配项,它将在新列中打印出 true 或 false。这是我的问题:我想打印出它在字符串中找到的名称,而不是打印出真或假。苹果、橙子等。
import pandas as pd
import numpy as np
text = [('I want to buy some apples.', 0),
('Oranges are good for the health.', 0),
('John is eating some grapes.', 0),
('This line does not contain any fruit names.', 0),
('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']
df = pd.DataFrame.from_records(text, columns=labels)
df.insert(2, "MatchedValues", np.nan)
foods =['apples', 'oranges', 'grapes', 'blueberries']
pattern = '|'.join(foods)
df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)
print(df)
Result
结果
Text Random Column MatchedValues
0 I want to buy some apples. 0 True
1 Oranges are good for the health. 0 True
2 John is eating some grapes. 0 True
3 This line does not contain any fruit names. 0 False
4 I bought 2 blueberries yesterday. 0 True
Wanted result
想要的结果
Text Random Column MatchedValues
0 I want to buy some apples. 0 apples
1 Oranges are good for the health. 0 oranges
2 John is eating some grapes. 0 grapes
3 This line does not contain any fruit names. 0 NaN
4 I bought 2 blueberries yesterday. 0 blueberries
回答by adr
You need to set the regex flag (to interpret your search as a regular expression):
您需要设置正则表达式标志(将您的搜索解释为正则表达式):
whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
case=False, regex=True)
df['New_Column'] = np.where(whatIwant, df['Column_with_text'])
------ Edit ------
- - - 编辑 - - -
Based on the updated problem statement, here is an updated answer:
根据更新的问题陈述,这里是更新的答案:
You need to define a capture group in the regular expression using parentheses and use the extract()
function to return the values found within the capture group. The lower()
function deals with any upper case letters
您需要使用括号在正则表达式中定义一个捕获组,并使用该extract()
函数返回在捕获组中找到的值。该lower()
函数处理任何大写字母
df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)
回答by jpp
Here is one way:
这是一种方法:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries