Python/Pandas：从列表中的字符串匹配的数据框中删除行

Question

提问by Sidney VanNess

I have a .csv file of contact information that I import as a pandas data frame.

我有一个联系信息的 .csv 文件，我将其作为 Pandas 数据框导入。

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('data.csv')
>>> df.head()

  fName   lName                    email   title
0  John   Smith         [email protected]     CEO
1   Joe   Schmo      [email protected]  Bagger
2  Some  Person  [email protected]   Clerk

After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:

导入数据后，我想删除一个字段包含列表中多个子字符串之一的行。例如：

to_drop = ['Clerk', 'Bagger']

for i in range(len(df)):
    for k in range(len(to_drop)):
        if to_drop[k] in df.title[i]:
            # some code to drop the rows from the data frame

df.to_csv("results.csv")

What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

在 Pandas 中执行此操作的首选方法是什么？这甚至应该是一个后处理步骤，还是首选在写入数据帧之前对其进行过滤？我的想法是，在数据框对象中操作一次会更容易。

Answer 1

采纳答案by EdChum

Use isinand pass your list of terms to search for you can then negate the boolean mask using ~and this will filter out those rows:

使用isin并传递您的术语列表进行搜索，然后您可以使用否定布尔掩码~，这将过滤掉这些行：

In [6]:

to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

Another method is to join the terms so it becomes a regex and use the vectorised str.contains:

另一种方法是加入条款，使其成为正则表达式并使用向量化str.contains：

In [8]:

df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

IMO 将过滤作为后处理步骤执行会更容易，也可能更快，因为如果您决定在阅读时进行过滤，那么您将迭代地增长效率不高的数据帧。

Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

或者，您可以分块读取 csv，过滤掉您不想要的行并将这些块附加到您的输出 csv

Answer 2

回答by Zero

Another way using query

另一种使用方式 query

In [961]: to_drop = ['Clerk', 'Bagger']

In [962]: df.query('title not in @to_drop')
Out[962]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

Python/Pandas：从列表中的字符串匹配的数据框中删除行

提问by Sidney VanNess

采纳答案by EdChum

回答by Zero

相关推荐

最近更新

标签

Python/Pandas：从列表中的字符串匹配的数据框中删除行

提问by Sidney VanNess

采纳答案by EdChum

回答by Zero

相关推荐

Python 在 Seaborn Barplot 上标记轴

Python Django ModelForm 中的 self.instance

Python中的“int(a[::-1])”是什么意思？

Python - 创建包含 2 个值之间的数字的列表？

相关推荐

最近更新

标签