如何在 Pandas 数据框列中搜索特定文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46516275/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:33:31  来源:igfitidea点击:

How to search for specific text within a Pandas dataframe column?

pythonstringpandasdataframe

提问by Dom B

I am wanting to identify all instances within my Pandas csv file that contains text for a specific column, in this case the 'Notes' column, where there are any instances the word 'excercise' is mentioned. Once the rows are identified that contain the 'excercise' keyword in the 'Notes' columnn, I want to create a new column called 'ExcerciseDay' that then has a 1 if the 'excercise' condition was met or a 0 if it was not. I am having trouble because the text can contain long string values in the 'Notes' column (i.e. 'Excercise, Morning Workout,Alcohol Consumed, Coffee Consumed') and I still want it to identify 'excercise' even if it is within a longer string.

我想识别包含特定列文本的 Pandas csv 文件中的所有实例,在本例中为“注释”列,其中提到了“练习”一词。一旦在“Notes”列中识别出包含“excercise”关键字的行,我想创建一个名为“ExcerciseDay”的新列,如果满足“excercise”条件则为 1,否则为 0 . 我遇到了麻烦,因为文本可以在“注释”列中包含长字符串值(即“锻炼、早晨锻炼、消耗的酒精、消耗的咖啡”),我仍然希望它识别“锻炼”,即使它在更长的时间内细绳。

I tried the function below in order to identify all text that contains the word 'exercise' in the 'Notes' column. No rows are selected when I use this function and I know it is likely because of the * operator but I want to show the logic. There is probably a much more efficient way to do this but I am still relatively new to programming and python.

我尝试了下面的功能,以识别“注释”列中包含“锻炼”一词的所有文本。当我使用这个函数时没有选择任何行,我知道这可能是因为 * 运算符,但我想显示逻辑。可能有一种更有效的方法来做到这一点,但我对编程和 python 仍然比较陌生。

def IdentifyExercise(row):
    if row['Notes'] == '*exercise*':
        return 1
    elif row['Notes'] != '*exercise*':
        return 0


JoinedTables['ExerciseDay'] = JoinedTables.apply(lambda row : IdentifyExercise(row), axis=1) 

回答by jezrael

Convert boolean Series created by str.containsto intby astype:

转换布尔系列创建人str.containsintastype

JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int)

For not case sensitive:

对于不区分大小写:

JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False)
                                                   .astype(int)

回答by cs95

You can also use np.where:

您还可以使用np.where

JoinedTables['ExerciseDay'] = \
    np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)

回答by JoseleMG

Another way would be:

另一种方法是:

JoinedTables['ExerciseDay'] =[1 if "exercise" in x  else 0 for x in JoinedTables['Notes']]

(Probably not the fastest solution)

(可能不是最快的解决方案)