Python/Pandas:从列表中的字符串匹配的数据框中删除行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31663426/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:23:04  来源:igfitidea点击:

Python/Pandas: Drop rows from data frame on string match from list

pythonpandas

提问by Sidney VanNess

I have a .csv file of contact information that I import as a pandas data frame.

我有一个联系信息的 .csv 文件,我将其作为 Pandas 数据框导入。

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('data.csv')
>>> df.head()

  fName   lName                    email   title
0  John   Smith         [email protected]     CEO
1   Joe   Schmo      [email protected]  Bagger
2  Some  Person  [email protected]   Clerk

After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:

导入数据后,我想删除一个字段包含列表中多个子字符串之一的行。例如:

to_drop = ['Clerk', 'Bagger']

for i in range(len(df)):
    for k in range(len(to_drop)):
        if to_drop[k] in df.title[i]:
            # some code to drop the rows from the data frame

df.to_csv("results.csv")

What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

在 Pandas 中执行此操作的首选方法是什么?这甚至应该是一个后处理步骤,还是首选在写入数据帧之前对其进行过滤?我的想法是,在数据框对象中操作一次会更容易。

采纳答案by EdChum

Use isinand pass your list of terms to search for you can then negate the boolean mask using ~and this will filter out those rows:

使用isin并传递您的术语列表进行搜索,然后您可以使用否定布尔掩码~,这将过滤掉这些行:

In [6]:

to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

Another method is to join the terms so it becomes a regex and use the vectorised str.contains:

另一种方法是加入条款,使其成为正则表达式并使用向量化str.contains

In [8]:

df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

IMO 将过滤作为后处理步骤执行会更容易,也可能更快,因为如果您决定在阅读时进行过滤,那么您将迭代地增长效率不高的数据帧。

Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

或者,您可以分块读取 csv,过滤掉您不想要的行并将这些块附加到您的输出 csv

回答by Zero

Another way using query

另一种使用方式 query

In [961]: to_drop = ['Clerk', 'Bagger']

In [962]: df.query('title not in @to_drop')
Out[962]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO