pandas 基于正则表达式过滤数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22290000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:47:52  来源:igfitidea点击:

Filtering a dataframe based on a regex

pythonregexpandas

提问by Amelio Vazquez-Reina

Say I have a dataframe my_dfwith a column 'brand', I would like to drop any rows where brand is either toyotaor bmw.

假设我有一个my_df带有列的数据框'brand',我想删除品牌为toyota或 的任何行bmw

I thought the following would do it:

我认为以下会做到这一点:

my_regex = re.compile('^(bmw$|toyota$).*$')
my_function = lambda x: my_regex.match(x.lower())
my_df[~df['brand'].apply(my_function)] 

but I get the error:

但我收到错误:

ValueError: cannot index with vector containing NA / NaN values

Why? How can I filter my DataFrame using a regex?

为什么?如何使用正则表达式过滤我的 DataFrame?

回答by behzad.nouri

I think re.matchreturns Nonewhen there is no match and that breaks the indexing; below is an alternative solution using pandas vectorized string methods; note that pandas string methods can handle null values:

我认为在没有匹配项时re.match返回None并破坏索引;下面是使用Pandas矢量化字符串方法的替代解决方案;请注意,pandas 字符串方法可以处理空值:

>>> df = pd.DataFrame( {'brand':['BMW', 'FORD', np.nan, None, 'TOYOTA', 'AUDI']})
>>> df
    brand
0     BMW
1    FORD
2     NaN
3    None
4  TOYOTA
5    AUDI

[6 rows x 1 columns]

>>> idx = df.brand.str.contains('^bmw$|^toyota$', 
             flags=re.IGNORECASE, regex=True, na=False)
>>> idx
0     True
1    False
2    False
3    False
4     True
5    False
Name: brand, dtype: bool

>>> df[~idx]
  brand
1  FORD
2   NaN
3  None
5  AUDI

[4 rows x 1 columns]