Pandas 数据框选择列表列包含任何字符串列表的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53342715/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe select rows where a list-column contains any of a list of strings
提问by NicoH
I've got a pandas DataFrame that looks like this:
我有一个如下所示的 Pandas DataFrame:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']
. So the result should look like this:
我喜欢提取一个只包含那些行的 DataFrame,其中包含任何selection = ['cat', 'dog']
. 所以结果应该是这样的:
molecule species
0 a [dog]
1 c [cat, dog]
2 d [cat, horse, pig]
What would be the simplest way to do this?
什么是最简单的方法来做到这一点?
For testing:
用于检测:
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
采纳答案by YOBEN_S
IIUC Re-create your df then using isin
with any
should be faster than apply
IIUC 重新创建您的 df 然后使用isin
withany
应该比apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
回答by Wes Doyle
You can use mask
with apply
here.
你可以mask
在apply
这里使用。
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
对于您在上面作为示例提供的 DataFrame,df1 将是:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
回答by Vaishali
Using Numpy would be much faster than using Pandas in this case,
在这种情况下,使用 Numpy 会比使用 Pandas 快得多,
Option 1: Using numpy intersection,
选项 1:使用 numpy 交集,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 μs ± 21.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
选项 2:使用 numpy in1d 的类似解决方案,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 μs ± 17.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
选项 3:有趣的是,这里使用纯 python 集相当快
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 μs ± 5.22 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
回答by Command
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list
are present in the pandas column list
.
这是一种简单而基本的方法。您可以创建一个函数来检查 中的元素Selection list
是否存在于 pandas 列中list
。
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list
to create a column that contains True
or False
based on whether the record contains at least one element in Selection List and create a new data frame based on it.
然后,您可以使用它list
来创建包含True
或False
基于记录是否包含选择列表中的至少一个元素的列,并基于它创建一个新的数据框。
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
I hope it helps.
我希望它有帮助。
回答by Ken Dekalb
Using pandas str.contains
(uses regular expression):
使用Pandasstr.contains
(使用正则表达式):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
输出:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
回答by ALEN M A
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)