Python/Pandas:从列表中的字符串匹配的数据框中删除行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31663426/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/Pandas: Drop rows from data frame on string match from list
提问by Sidney VanNess
I have a .csv file of contact information that I import as a pandas data frame.
我有一个联系信息的 .csv 文件,我将其作为 Pandas 数据框导入。
>>> import pandas as pd
>>>
>>> df = pd.read_csv('data.csv')
>>> df.head()
fName lName email title
0 John Smith [email protected] CEO
1 Joe Schmo [email protected] Bagger
2 Some Person [email protected] Clerk
After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:
导入数据后,我想删除一个字段包含列表中多个子字符串之一的行。例如:
to_drop = ['Clerk', 'Bagger']
for i in range(len(df)):
for k in range(len(to_drop)):
if to_drop[k] in df.title[i]:
# some code to drop the rows from the data frame
df.to_csv("results.csv")
What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.
在 Pandas 中执行此操作的首选方法是什么?这甚至应该是一个后处理步骤,还是首选在写入数据帧之前对其进行过滤?我的想法是,在数据框对象中操作一次会更容易。
采纳答案by EdChum
Use isin
and pass your list of terms to search for you can then negate the boolean mask using ~
and this will filter out those rows:
使用isin
并传递您的术语列表进行搜索,然后您可以使用否定布尔掩码~
,这将过滤掉这些行:
In [6]:
to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
fName lName email title
0 John Smith [email protected] CEO
Another method is to join the terms so it becomes a regex and use the vectorised str.contains
:
另一种方法是加入条款,使其成为正则表达式并使用向量化str.contains
:
In [8]:
df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
fName lName email title
0 John Smith [email protected] CEO
IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.
IMO 将过滤作为后处理步骤执行会更容易,也可能更快,因为如果您决定在阅读时进行过滤,那么您将迭代地增长效率不高的数据帧。
Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv
或者,您可以分块读取 csv,过滤掉您不想要的行并将这些块附加到您的输出 csv
回答by Zero
Another way using query
另一种使用方式 query
In [961]: to_drop = ['Clerk', 'Bagger']
In [962]: df.query('title not in @to_drop')
Out[962]:
fName lName email title
0 John Smith [email protected] CEO