pandas 在python pandas中搜索整行数据帧的多个字符串值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50845987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:41:36  来源:igfitidea点击:

Search for Multiple String Values of Entire Row of Dataframe in python pandas

pythonstringpandasdataframe

提问by BrianBeing

In a pandas dataframe, I want to search row by row for multiple string values. If the row contains a string value then the function will add/print for that row, into an empty column at the end of the df 1 or 0 based upon
There have been multiple tutorials on how to select rows of a Pandas DataFrame that match a (partial) string.

在Pandas数据框中,我想逐行搜索多个字符串值。如果该行包含一个字符串值,则该功能将根据在DF 1或0的末尾添加/打印该行,成为一个空列
已经有关于如何选择一个Pandas数据帧的行多个教程比赛一(部分)字符串。

For Example:

例如:

import pandas as pd

#create sample data
data = {'model': ['Lisa', 'Lisa 2', 'Macintosh 128K', 'Macintosh 512K'],
        'launched': [1983,1984,1984,1984],
        'discontinued': [1986, 1985, 1984, 1986]}

df = pd.DataFrame(data, columns = ['model', 'launched', 'discontinued'])
df

I'm pulling the above example from this website: https://davidhamann.de/2017/06/26/pandas-select-elements-by-string/

我从这个网站上拉出上面的例子:https: //davidhamann.de/2017/06/26/pandas-select-elements-by-string/

How would I do a multi-value search of the entire row for: 'int', 'tos', '198'?

我将如何对整行进行多值搜索:'int'、'tos'、'198'?

Then print into a column next discontinued, a column int that would have 1 or 0 based upon whether the row contained that keyword.

然后打印到下一个停止的列中,列 int 根据该行是否包含该关键字而具有 1 或 0。

采纳答案by mrGreenBrown

So the simplest method without using fancy pandas staff would be to use two for loops. I would like if someone could give a better solution, but my approach would be this:

因此,不使用花哨的Pandas工作人员的最简单方法是使用两个 for 循环。我想如果有人可以提供更好的解决方案,但我的方法是这样的:

def check_all_for(column_name, search_terms):
    df[column_name] = ''
    for row in df.iterrows():
        flag = 0
        for element in row:
            for search_term in search_terms:
                if search_term in (str(element)).lower():
                    flag = 1
        row[column_name] = flag

Assumption is that you have dataframedefined as dfand you want to flag the new column with 1 and 0

假设您已dataframe定义为df并且您想用 1 和 0 标记新列

回答by rafaelc

If you have

如果你有

l=['int', 'tos', '198']

Then you use str.containsby joining with '|'to get every model that contains any of these words

然后你使用str.containsby join with'|'来获取包含这些单词中的任何一个的每个模型

df.model.str.contains('|'.join(l))

0    False
1    False
2     True
3     True

Edit

编辑

If the intention is to check all columns as @jpp interpreted, I'd suggest:

如果打算按照@jpp 的解释检查所有列,我建议:

from functools import reduce
res = reduce(lambda a,b: a | b, [df[col].astype(str).str.contains(m) for col in df.columns])

0    False
1     True
2     True
3     True

If you want it as a column with integer values, just do

如果您希望将其作为具有整数值的列,请执行

df['new_col'] = res.astype(int)

     new_col
0    0
1    1
2    1
3    1

回答by jpp

If I understand correctly, you wish to check the existence of strings across all columns in each row. This is not straightforward given you have mixed types (integers, strings). One way is to use pd.DataFrame.applywith a custom function.

如果我理解正确,您希望检查每行中所有列中字符串的存在。鉴于您有混合类型(整数、字符串),这并不简单。一种方法是使用pd.DataFrame.apply自定义函数。

The main point we need to remember is to convert your entire dataframe to type str, since you cannot test the existence of substrings within an integer.

我们需要记住的要点是将整个数据帧转换为 type str,因为您无法测试整数中子字符串的存在。

match = ['int', 'tos', '1985']

def string_finder(row, words):
    if any(word in field for field in row for word in words):
        return True
    return False

df['isContained'] = df.astype(str).apply(string_finder, words=match, axis=1)

print(df)

            model  launched  discontinued  isContained
0            Lisa      1983          1986        False
1          Lisa 2      1984          1985         True
2  Macintosh 128K      1984          1984         True
3  Macintosh 512K      1984          1986         True

回答by Feras

@Guy_Fuqua, my understanding that you want to assure that all words included in one row, am I right?

@Guy_Fuqua,我的理解是您想确保所有单词都包含在一行中,对吗?

if so, then a little modification for jpp answer shall help you to achieve this,kindly note the AssessAllString function here

如果是这样,那么对 jpp answer 稍作修改将帮助您实现这一点,请注意这里的 AssessAllString 函数

match = ['int', 'tos', '1984']

def string_finder(row, words):
    if any(word in field for field in row for word in words):
        return True
    return False

def AssessAllString (row,words):
    b=True
    for x in words:
      b = b&string_finder(row,[x])
    return b

df['isContained'] = df.astype(str).apply(AssessAllString, words=match, axis=1)

print(df)

            model  launched  discontinued  isContained
0  Lisa            1983      1986          False      
1  Lisa 2          1984      1985          False      
2  Macintosh 128K  1984      1984          True       
3  Macintosh 512K  1984      1986          True 

Another Example for :

另一个例子:

match = ['isa','1984']
df['isContained'] = df.astype(str).apply(AssessAllString, words=match, axis=1)

            model  launched  discontinued  isContained
0  Lisa            1983      1986          False      
1  Lisa 2          1984      1985          True       
2  Macintosh 128K  1984      1984          False      
3  Macintosh 512K  1984      1986          False 

I believe code still need optimization, but so far shall fit the purpose

我相信代码仍然需要优化,但到目前为止应该符合目的

回答by harvpan

You need to check if modelis a substring of matchor not.

您需要检查是否model是 的子串match

match = [ 'int', 'tos', '198']
df['isContained'] = df['model'].apply(lambda x: 1 if any(s in x for s in match) else 0)

Output:

输出:

            model  launched  discontinued  isContained
0            Lisa      1983          1986            0
1          Lisa 2      1984          1985            0
2  Macintosh 128K      1984          1984            1
3  Macintosh 512K      1984          1986            1