Pandas DataFrame 检查一组列中是否存在列值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/52393659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:02:15  来源:igfitidea点击:

Pandas DataFrame check if column value exists in a group of columns

pythonpandasnumpydataframe

提问by EGM8686

I have a DataFrame like this (simplified example)

我有一个这样的 DataFrame(简化示例)

id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101

And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.

并想创建一个额外的列,它要么是 1 要么是 0。如果 v0 值在 v1 到 v4 的值中,则为 1,否则为 0。因此,在此示例中,id 1 的值应为 1(因为 v2 = 10),而 id 2 的值应为 0,因为 22 不在 v1 到 v4 中。

In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99).

实际上,该表要大得多(大约 100,000 行和变量从 v1 到 v99)。

采纳答案by user3483203

You can use the underlying numpyarrays for performance:

您可以使用底层numpy数组来提高性能:

Setup

设置

a = df.v0.values
b = df.iloc[:, 2:].values


df.assign(out=(a[:, None]==b).any(1).astype(int))

   id  v0  v1  v2   v3   v4  out
0   1  10   5  10   22   50    1
1   2  22  23  55   60   50    0
2   3   8   2  40   80  110    0
3   4  15  15  25  100  101    1


This solution leverages broadcasting to allow for pairwise comparison:

该解决方案利用广播来进行成对比较:

First, we broadcast a:

首先,我们广播a

>>> a[:, None]
array([[10],
       [22],
       [ 8],
       [15]], dtype=int64)

Which allows for pairwise comparison with b:

这允许与 成对比较b

>>> a[:, None] == b
array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]])

We then simply check for any Trueresults along the first axis, and convert to integer.

然后我们简单地检查True沿第一个轴的任何结果,并转换为整数。



Performance

表现



Functions

职能

def user_chris(df):
    a = df.v0.values
    b = df.iloc[:, 2:].values
    return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
    df = df.set_index('id')
    return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
    return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
    return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
    df.set_index('id', inplace=True)
    return df.isin(df.pop('v0')).any(1).astype(int)

Setup

设置

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
       index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
       columns=[10, 50, 100, 500, 1000, 5000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        vals = np.random.randint(1, 100, (c, c))
        vals = np.column_stack((np.arange(vals.shape[0]), vals))
        df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Output

输出

enter image description here

在此处输入图片说明

回答by Anton vBR

Another take, most likely the smallest syntax:

另一种做法,很可能是最小的语法:

df['new'] = df.isin(df.pop('v0')).any(1).astype(int)

Full proof:

完整证明:

import pandas as pd

data = '''\
id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df.set_index('id', inplace=True)
df['new'] = df.isin(df.pop('v0')).any(1).astype(int)
print(df)

Returns:

返回:

    v1  v2   v3   v4  new
id                       
1    5  10   22   50    1
2   23  55   60   50    0
3    2  40   80  110    0
4   15  25  100  101    1

回答by Chris A

How about:

怎么样:

df['new_col'] = df.loc[:, "v1":].eq(df['v0'],0).any(1).astype(int)

[out]

[出去]

   id  v0  v1  v2   v3   v4  new_col
0   1  10   5  10   22   50        1
1   2  22  23  55   60   50        0
2   3   8   2  40   80  110        0
3   4  15  15  25  100  101        1

回答by rahlf23

I'm assuming here that idis set to be your dataframe index here:

我在这里假设这里id设置为您的数据帧索引:

df = df.set_index('id')

Then the following should work (similar answer here):

那么以下应该工作(类似的答案在这里):

df['New'] = df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

Gives:

给出:

    v0  v1  v2   v3   v4  New
id                           
1   10   5  10   22   50    1
2   22  23  55   60   50    0
3    8   2  40   80  110    0
4   15  15  25  100  101    1

回答by Yo_Chris

You can also use a lambda function:

您还可以使用 lambda 函数:

df['newCol'] = df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

    id  v0  v1  v2  v3  v4  newCol
0   1   10  5   10  22  50  1
1   2   22  23  55  60  50  0
2   3   8   2   40  80  110 0
3   4   15  15  25  100 101 1