Python 从熊猫数据框中选择包含某些值的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38185688/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Select rows containing certain values from pandas dataframe
提问by rferdinand
I have a pandas dataframe whose entries are all strings:
我有一个 Pandas 数据框,它的条目都是字符串:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?
等我想选择包含某个字符串的所有行,比如“香蕉”。不知道每次会出现在哪一栏。当然,我可以编写一个 for 循环并遍历所有行。但是有没有更简单或更快的方法来做到这一点?
采纳答案by Divakar
With NumPy, it could be vectorized to search for as many strings as you wish, like so -
使用 NumPy,它可以被矢量化以搜索任意数量的字符串,就像这样 -
def select_rows(df,search_strings):
unq,IDs = np.unique(df,return_inverse=True)
unqIDs = np.searchsorted(unq,search_strings)
return df[((IDs.reshape(df.shape) == unqIDs[:,None,None]).any(-1)).all(0)]
Sample run -
样品运行 -
In [393]: df
Out[393]:
A B C
0 apple banana pear
1 pear pear apple
2 banana pear pear
3 apple apple pear
In [394]: select_rows(df,['apple','banana'])
Out[394]:
A B C
0 apple banana pear
In [395]: select_rows(df,['apple','pear'])
Out[395]:
A B C
0 apple banana pear
1 pear pear apple
3 apple apple pear
In [396]: select_rows(df,['apple','banana','pear'])
Out[396]:
A B C
0 apple banana pear
回答by Merlin
For single search value
对于单个搜索值
df[df.values == "banana"]
or
或者
df[df.isin(['banana'])]
For multiple search terms:
对于多个搜索词:
df[(df.values == "banana")|(df.values == "apple" ) ]
or
或者
df[df.isin(['banana', "apple"])]
# A B C
# 1 apple banana NaN
# 2 NaN NaN apple
# 3 banana NaN NaN
# 4 apple apple NaN
From Divakar: lines with both are returned.
来自 Divakar:返回带有两者的行。
select_rows(df,['apple','banana'])
# A B C
# 0 apple banana pear
回答by EdChum
You can create a boolean mask from comparing the entire df against your string and call dropna
passing param how='all'
to drop rows where your string doesn't appear in all cols:
您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码,并调用dropna
传递参数how='all'
来删除您的字符串未出现在所有列中的行:
In [59]:
df[df == 'banana'].dropna(how='all')
Out[59]:
A B C
1 NaN banana NaN
3 banana NaN NaN
To test for multiple values you can use multiple masks:
要测试多个值,您可以使用多个掩码:
In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana
Out[90]:
A B C
1 NaN banana NaN
3 banana NaN NaN
In [91]:
apple = df[(df=='apple')].dropna(how='all')
apple
Out[91]:
A B C
1 apple NaN NaN
2 NaN NaN apple
4 apple apple NaN
You can use index.intersection
to index just the common index values:
您可以使用index.intersection
仅索引常见索引值:
In [93]:
df.loc[apple.index.intersection(banana.index)]
Out[93]:
A B C
1 apple banana pear
回答by avim
If you want all rows of df
contains anyof the values in values
, use:
如果您希望 的所有行都df
包含 中的任何值values
,请使用:
df[df.isin(values).any(1)]
Example:
例子:
In [2]: df
Out[2]:
0 1 2
0 7 4 9
1 8 2 7
2 1 9 7
3 3 8 5
4 5 1 1
In [3]: df[df.isin({1, 9, 123}).any(1)]
Out[3]:
0 1 2
0 7 4 9
2 1 9 7
4 5 1 1