Python 从熊猫数据框中选择包含某些值的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38185688/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:28:18  来源:igfitidea点击:

Select rows containing certain values from pandas dataframe

pythonpandas

提问by rferdinand

I have a pandas dataframe whose entries are all strings:

我有一个 Pandas 数据框,它的条目都是字符串:

   A     B      C
1 apple  banana pear
2 pear   pear   apple
3 banana pear   pear
4 apple  apple  pear

etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?

等我想选择包含某个字符串的所有行,比如“香蕉”。不知道每次会出现在哪一栏。当然,我可以编写一个 for 循环并遍历所有行。但是有没有更简单或更快的方法来做到这一点?

采纳答案by Divakar

With NumPy, it could be vectorized to search for as many strings as you wish, like so -

使用 NumPy,它可以被矢量化以搜索任意数量的字符串,就像这样 -

def select_rows(df,search_strings):
    unq,IDs = np.unique(df,return_inverse=True)
    unqIDs = np.searchsorted(unq,search_strings)
    return df[((IDs.reshape(df.shape) == unqIDs[:,None,None]).any(-1)).all(0)]

Sample run -

样品运行 -

In [393]: df
Out[393]: 
        A       B      C
0   apple  banana   pear
1    pear    pear  apple
2  banana    pear   pear
3   apple   apple   pear

In [394]: select_rows(df,['apple','banana'])
Out[394]: 
       A       B     C
0  apple  banana  pear

In [395]: select_rows(df,['apple','pear'])
Out[395]: 
       A       B      C
0  apple  banana   pear
1   pear    pear  apple
3  apple   apple   pear

In [396]: select_rows(df,['apple','banana','pear'])
Out[396]: 
       A       B     C
0  apple  banana  pear

回答by Merlin

For single search value

对于单个搜索值

df[df.values  == "banana"]

or

或者

 df[df.isin(['banana'])]

For multiple search terms:

对于多个搜索词:

  df[(df.values  == "banana")|(df.values  == "apple" ) ]

or

或者

df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN

From Divakar: lines with both are returned.

来自 Divakar:返回带有两者的行。

select_rows(df,['apple','banana'])

 #         A       B     C
 #   0  apple  banana  pear

回答by EdChum

You can create a boolean mask from comparing the entire df against your string and call dropnapassing param how='all'to drop rows where your string doesn't appear in all cols:

您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码,并调用dropna传递参数how='all'来删除您的字符串未出现在所有列中的行:

In [59]:
df[df == 'banana'].dropna(how='all')

Out[59]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

To test for multiple values you can use multiple masks:

要测试多个值,您可以使用多个掩码:

In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana

Out[90]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

In [91]:    
apple = df[(df=='apple')].dropna(how='all')
apple

Out[91]:
       A      B      C
1  apple    NaN    NaN
2    NaN    NaN  apple
4  apple  apple    NaN

You can use index.intersectionto index just the common index values:

您可以使用index.intersection仅索引常见索引值:

In [93]:
df.loc[apple.index.intersection(banana.index)]

Out[93]:
       A       B     C
1  apple  banana  pear

回答by avim

If you want all rows of dfcontains anyof the values in values, use:

如果您希望 的所有行都df包含 中的任何values,请使用:

df[df.isin(values).any(1)]


Example:

例子:

In [2]: df                                                                                                                       
Out[2]: 
   0  1  2
0  7  4  9
1  8  2  7
2  1  9  7
3  3  8  5
4  5  1  1

In [3]: df[df.isin({1, 9, 123}).any(1)]                                                                                          
Out[3]: 
   0  1  2
0  7  4  9
2  1  9  7
4  5  1  1