Pandas 数据框选择列表列包含任何字符串列表的行

Question

提问by NicoH

I've got a pandas DataFrame that looks like this:

我有一个如下所示的 Pandas DataFrame：

  molecule            species
0        a              [dog]
1        b       [horse, pig]
2        c         [cat, dog]
3        d  [cat, horse, pig]
4        e     [chicken, pig]

and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']. So the result should look like this:

我喜欢提取一个只包含那些行的 DataFrame，其中包含任何selection = ['cat', 'dog']. 所以结果应该是这样的：

  molecule            species
0        a              [dog]
1        c         [cat, dog]
2        d  [cat, horse, pig]

What would be the simplest way to do this?

什么是最简单的方法来做到这一点？

For testing:

用于检测：

selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})

Answer 1

采纳答案by YOBEN_S

IIUC Re-create your df then using isinwith anyshould be faster than apply

IIUC 重新创建您的 df 然后使用isinwithany应该比apply

df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]: 
  molecule            species
0        a              [dog]
2        c         [cat, dog]
3        d  [cat, horse, pig]

Answer 2

回答by Wes Doyle

You can use maskwith applyhere.

你可以mask在apply这里使用。

selection = ['cat', 'dog']

mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]

For the DataFrame you've provided as an example above, df1 will be:

对于您在上面作为示例提供的 DataFrame，df1 将是：

molecule    species
0   a   [dog]
2   c   [cat, dog]
3   d   [cat, horse, pig]

Answer 3

回答by Vaishali

Using Numpy would be much faster than using Pandas in this case,

在这种情况下，使用 Numpy 会比使用 Pandas 快得多，

Option 1: Using numpy intersection,

选项 1：使用 numpy 交集，

mask =  df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 μs ± 21.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    molecule    species
0   a   [dog]
2   c   [cat, dog]
3   d   [cat, horse, pig]

Option2: A similar solution as above using numpy in1d,

选项 2：使用 numpy in1d 的类似解决方案，

df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 μs ± 17.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Option 3: Interestingly, using pure python set is quite fast here

选项 3：有趣的是，这里使用纯 python 集相当快

df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 μs ± 5.22 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 4

回答by Command

This is an easy and basic approach. You can create a function that checks if the elements in Selection listare present in the pandas column list.

这是一种简单而基本的方法。您可以创建一个函数来检查中的元素Selection list是否存在于 pandas 列中list。

def check(speciesList):
    flag = False
    for animal in selection:
        if animal in speciesList:
            flag = True
    return flag

You could then use this listto create a column that contains Trueor Falsebased on whether the record contains at least one element in Selection List and create a new data frame based on it.

然后，您可以使用它list来创建包含True或False基于记录是否包含选择列表中的至少一个元素的列，并基于它创建一个新的数据框。

df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]

I hope it helps.

我希望它有帮助。

Answer 5

回答by Ken Dekalb

Using pandas str.contains(uses regular expression):

使用Pandasstr.contains（使用正则表达式）：

df[~df["species"].str.contains('(cat|dog)', regex=True)]

Output:

输出：

    molecule    species
1   b   [horse, pig]
4   e   [chicken, pig]

Answer 6

回答by ALEN M A

import  pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})

df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)

Pandas 数据框选择列表列包含任何字符串列表的行

提问by NicoH

采纳答案by YOBEN_S

回答by Wes Doyle

回答by Vaishali

回答by Command

回答by Ken Dekalb

回答by ALEN M A

相关推荐

最近更新

标签

Pandas 数据框选择列表列包含任何字符串列表的行

提问by NicoH

采纳答案by YOBEN_S

回答by Wes Doyle

回答by Vaishali

回答by Command

回答by Ken Dekalb

回答by ALEN M A

相关推荐

将 Pandas 数据帧转换为 PySpark 数据帧

将 XML 文件读取到 Pandas DataFrame

Pandas - 删除列索引的标签

处理错误“TypeError：预期的元组，得到了str”将CSV加载到pandas多级和多索引（pandas）

相关推荐

最近更新

标签