根据 Pandas 中的字符串列表过滤出行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28914078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:01:32  来源:igfitidea点击:

Filter out rows based on list of strings in Pandas

pythonpandasfilter

提问by geokrowding

I have a large time series data frame (called df), and the first 5 records look like this:

我有一个大的时间序列数据框(称为df),前 5 条记录如下所示:

df

         stn     years_of_data  total_minutes avg_daily TOA_daily   K_daily
date                        
1900-01-14  AlberniElementary      4    5745    34.100  114.600 0.298
1900-01-14  AlberniWeather         6    7129    29.500  114.600 0.257
1900-01-14  Arbutus                8    11174   30.500  114.600 0.266
1900-01-14  Arrowview              7    10080   27.600  114.600 0.241
1900-01-14  Bayside                7    9745    33.800  114.600 0.295

Goal:

目标:

I am trying to remove rows where anyof the strings in a list are present in the 'stn'column. So,I am basically trying to filter this dataset to not include rows containing any of the strings in following list.

我正在尝试删除“stn”列中存在列表中任何字符串行。因此,我基本上是在尝试过滤此数据集以不包含包含以下列表中任何字符串的行。

Attempt:

试图:

remove_list = ['Arbutus','Bayside']

cleaned = df[df['stn'].str.contains('remove_list')]

Returns:

返回:

Out[78]:

出[78]:

stn years_of_data   total_minutes   avg_daily   TOA_daily   K_daily
date    

Nothing!

没有!

I have tried a few combinations of quotes, brackets, and even a lambda function; though I am fairly new, so probably not using syntax properly..

我尝试了几种引号、括号甚至 lambda 函数的组合;虽然我是新手,所以可能没有正确使用语法..

回答by EdChum

Use isin:

使用isin

cleaned = df[~df['stn'].isin(remove_list)]

In [7]:

remove_list = ['Arbutus','Bayside']
df[~df['stn'].isin(remove_list)]
Out[7]:
                          stn  years_of_data  total_minutes  avg_daily  \
date                                                                     
1900-01-14  AlberniElementary              4           5745       34.1   
1900-01-14     AlberniWeather              6           7129       29.5   
1900-01-14          Arrowview              7          10080       27.6   

            TOA_daily  K_daily  
date                            
1900-01-14      114.6    0.298  
1900-01-14      114.6    0.257  
1900-01-14      114.6    0.241  

回答by rajan

Had a similar question, found this old thread, I think there are other ways to get the same result. My issue with @EdChum's solution for my particular application is that I don't have a list that will be matched exactly. If you have the same issue, .isinisn't meant for that application.

有一个类似的问题,找到了这个旧线程,我认为还有其他方法可以获得相同的结果。我对@EdChum 针对我的特定应用程序的解决方案的问题是,我没有可以完全匹配的列表。如果您有同样的问题,.isin则不适用于该应用程序。

Instead, you can also try a few options, including a numpy.where:

相反,您还可以尝试一些选项,包括 numpy.where:

  removelist = ['ayside','rrowview']
  df['flagCol'] = numpy.where(df.stn.str.contains('|'.join(remove_list)),1,0)

Note that this solution doesn't actually remove the matching rows, just flags them. You can copy/slice/drop as you like.

请注意,此解决方案实际上并没有删除匹配的行,只是标记它们。您可以根据需要复制/切片/删除。

This solution would be useful in the case that you don't know, for example, if the station names are capitalized or not and don't want to go through standardizing text beforehand. numpy.whereis usually pretty fast as well, probably not much different from .isin.

此解决方案在您不知道的情况下很有用,例如,站名是否大写并且不想事先通过标准化文本。numpy.where通常也很快,可能与.isin.