pandas 在python pandas中搜索整行数据帧的多个字符串值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50845987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Search for Multiple String Values of Entire Row of Dataframe in python pandas
提问by BrianBeing
In a pandas dataframe, I want to search row by row for multiple string values. If the row contains a string value then the function will add/print for that row, into an empty column at the end of the df 1 or 0 based upon
There have been multiple tutorials on how to select rows of a Pandas DataFrame that match a (partial) string.
在Pandas数据框中,我想逐行搜索多个字符串值。如果该行包含一个字符串值,则该功能将根据在DF 1或0的末尾添加/打印该行,成为一个空列
已经有关于如何选择一个Pandas数据帧的行多个教程比赛一(部分)字符串。
For Example:
例如:
import pandas as pd
#create sample data
data = {'model': ['Lisa', 'Lisa 2', 'Macintosh 128K', 'Macintosh 512K'],
'launched': [1983,1984,1984,1984],
'discontinued': [1986, 1985, 1984, 1986]}
df = pd.DataFrame(data, columns = ['model', 'launched', 'discontinued'])
df
I'm pulling the above example from this website: https://davidhamann.de/2017/06/26/pandas-select-elements-by-string/
我从这个网站上拉出上面的例子:https: //davidhamann.de/2017/06/26/pandas-select-elements-by-string/
How would I do a multi-value search of the entire row for: 'int', 'tos', '198'?
我将如何对整行进行多值搜索:'int'、'tos'、'198'?
Then print into a column next discontinued, a column int that would have 1 or 0 based upon whether the row contained that keyword.
然后打印到下一个停止的列中,列 int 根据该行是否包含该关键字而具有 1 或 0。
采纳答案by mrGreenBrown
So the simplest method without using fancy pandas staff would be to use two for loops. I would like if someone could give a better solution, but my approach would be this:
因此,不使用花哨的Pandas工作人员的最简单方法是使用两个 for 循环。我想如果有人可以提供更好的解决方案,但我的方法是这样的:
def check_all_for(column_name, search_terms):
df[column_name] = ''
for row in df.iterrows():
flag = 0
for element in row:
for search_term in search_terms:
if search_term in (str(element)).lower():
flag = 1
row[column_name] = flag
Assumption is that you have dataframe
defined as df
and you want to flag the new column with 1 and 0
假设您已dataframe
定义为df
并且您想用 1 和 0 标记新列
回答by rafaelc
If you have
如果你有
l=['int', 'tos', '198']
Then you use str.contains
by joining with '|'
to get every model that contains any of these words
然后你使用str.contains
by join with'|'
来获取包含这些单词中的任何一个的每个模型
df.model.str.contains('|'.join(l))
0 False
1 False
2 True
3 True
Edit
编辑
If the intention is to check all columns as @jpp interpreted, I'd suggest:
如果打算按照@jpp 的解释检查所有列,我建议:
from functools import reduce
res = reduce(lambda a,b: a | b, [df[col].astype(str).str.contains(m) for col in df.columns])
0 False
1 True
2 True
3 True
If you want it as a column with integer values, just do
如果您希望将其作为具有整数值的列,请执行
df['new_col'] = res.astype(int)
new_col
0 0
1 1
2 1
3 1
回答by jpp
If I understand correctly, you wish to check the existence of strings across all columns in each row. This is not straightforward given you have mixed types (integers, strings). One way is to use pd.DataFrame.apply
with a custom function.
如果我理解正确,您希望检查每行中所有列中字符串的存在。鉴于您有混合类型(整数、字符串),这并不简单。一种方法是使用pd.DataFrame.apply
自定义函数。
The main point we need to remember is to convert your entire dataframe to type str
, since you cannot test the existence of substrings within an integer.
我们需要记住的要点是将整个数据帧转换为 type str
,因为您无法测试整数中子字符串的存在。
match = ['int', 'tos', '1985']
def string_finder(row, words):
if any(word in field for field in row for word in words):
return True
return False
df['isContained'] = df.astype(str).apply(string_finder, words=match, axis=1)
print(df)
model launched discontinued isContained
0 Lisa 1983 1986 False
1 Lisa 2 1984 1985 True
2 Macintosh 128K 1984 1984 True
3 Macintosh 512K 1984 1986 True
回答by Feras
@Guy_Fuqua, my understanding that you want to assure that all words included in one row, am I right?
@Guy_Fuqua,我的理解是您想确保所有单词都包含在一行中,对吗?
if so, then a little modification for jpp answer shall help you to achieve this,kindly note the AssessAllString function here
如果是这样,那么对 jpp answer 稍作修改将帮助您实现这一点,请注意这里的 AssessAllString 函数
match = ['int', 'tos', '1984']
def string_finder(row, words):
if any(word in field for field in row for word in words):
return True
return False
def AssessAllString (row,words):
b=True
for x in words:
b = b&string_finder(row,[x])
return b
df['isContained'] = df.astype(str).apply(AssessAllString, words=match, axis=1)
print(df)
model launched discontinued isContained
0 Lisa 1983 1986 False
1 Lisa 2 1984 1985 False
2 Macintosh 128K 1984 1984 True
3 Macintosh 512K 1984 1986 True
Another Example for :
另一个例子:
match = ['isa','1984']
df['isContained'] = df.astype(str).apply(AssessAllString, words=match, axis=1)
model launched discontinued isContained
0 Lisa 1983 1986 False
1 Lisa 2 1984 1985 True
2 Macintosh 128K 1984 1984 False
3 Macintosh 512K 1984 1986 False
I believe code still need optimization, but so far shall fit the purpose
我相信代码仍然需要优化,但到目前为止应该符合目的
回答by harvpan
You need to check if model
is a substring of match
or not.
您需要检查是否model
是 的子串match
。
match = [ 'int', 'tos', '198']
df['isContained'] = df['model'].apply(lambda x: 1 if any(s in x for s in match) else 0)
Output:
输出:
model launched discontinued isContained
0 Lisa 1983 1986 0
1 Lisa 2 1984 1985 0
2 Macintosh 128K 1984 1984 1
3 Macintosh 512K 1984 1986 1