Python 从熊猫数据框中选择包含某些值的行

Question

提问by rferdinand

I have a pandas dataframe whose entries are all strings:

我有一个 Pandas 数据框，它的条目都是字符串：

   A     B      C
1 apple  banana pear
2 pear   pear   apple
3 banana pear   pear
4 apple  apple  pear

etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?

等我想选择包含某个字符串的所有行，比如“香蕉”。不知道每次会出现在哪一栏。当然，我可以编写一个 for 循环并遍历所有行。但是有没有更简单或更快的方法来做到这一点？

Answer 1

采纳答案by Divakar

With NumPy, it could be vectorized to search for as many strings as you wish, like so -

使用 NumPy，它可以被矢量化以搜索任意数量的字符串，就像这样 -

def select_rows(df,search_strings):
    unq,IDs = np.unique(df,return_inverse=True)
    unqIDs = np.searchsorted(unq,search_strings)
    return df[((IDs.reshape(df.shape) == unqIDs[:,None,None]).any(-1)).all(0)]

Sample run -

样品运行 -

In [393]: df
Out[393]: 
        A       B      C
0   apple  banana   pear
1    pear    pear  apple
2  banana    pear   pear
3   apple   apple   pear

In [394]: select_rows(df,['apple','banana'])
Out[394]: 
       A       B     C
0  apple  banana  pear

In [395]: select_rows(df,['apple','pear'])
Out[395]: 
       A       B      C
0  apple  banana   pear
1   pear    pear  apple
3  apple   apple   pear

In [396]: select_rows(df,['apple','banana','pear'])
Out[396]: 
       A       B     C
0  apple  banana  pear

Answer 2

回答by Merlin

For single search value

对于单个搜索值

df[df.values  == "banana"]

or

或者

 df[df.isin(['banana'])]

For multiple search terms:

对于多个搜索词：

  df[(df.values  == "banana")|(df.values  == "apple" ) ]

or

或者

df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN

From Divakar: lines with both are returned.

来自 Divakar：返回带有两者的行。

select_rows(df,['apple','banana'])

 #         A       B     C
 #   0  apple  banana  pear

Answer 3

回答by EdChum

You can create a boolean mask from comparing the entire df against your string and call dropnapassing param how='all'to drop rows where your string doesn't appear in all cols:

您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码，并调用dropna传递参数how='all'来删除您的字符串未出现在所有列中的行：

In [59]:
df[df == 'banana'].dropna(how='all')

Out[59]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

To test for multiple values you can use multiple masks:

要测试多个值，您可以使用多个掩码：

In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana

Out[90]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

In [91]:    
apple = df[(df=='apple')].dropna(how='all')
apple

Out[91]:
       A      B      C
1  apple    NaN    NaN
2    NaN    NaN  apple
4  apple  apple    NaN

You can use index.intersectionto index just the common index values:

您可以使用index.intersection仅索引常见索引值：

In [93]:
df.loc[apple.index.intersection(banana.index)]

Out[93]:
       A       B     C
1  apple  banana  pear

Answer 4

回答by avim

If you want all rows of dfcontains anyof the values in values, use:

如果您希望的所有行都df包含中的任何值values，请使用：

df[df.isin(values).any(1)]

Example:

例子：

In [2]: df                                                                                                                       
Out[2]: 
   0  1  2
0  7  4  9
1  8  2  7
2  1  9  7
3  3  8  5
4  5  1  1

In [3]: df[df.isin({1, 9, 123}).any(1)]                                                                                          
Out[3]: 
   0  1  2
0  7  4  9
2  1  9  7
4  5  1  1

Python 从熊猫数据框中选择包含某些值的行

提问by rferdinand

采纳答案by Divakar

回答by Merlin

回答by EdChum

回答by avim

相关推荐

最近更新

标签

Python 从熊猫数据框中选择包含某些值的行

提问by rferdinand

采纳答案by Divakar

回答by Merlin

回答by EdChum

回答by avim

相关推荐

Python 没有名为 PIL 的模块

Python 导入错误：无法从“sklearn”导入名称“cross_validation”

Python 数据框对象没有属性

Python 在 Pandas DataFrame 中的任意位置搜索值

相关推荐

最近更新

标签