pandas 数据框列值与列表的比较
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44218378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Comparison of a Dataframe column values with a list
提问by Mehrdad Salimi
Consider this Dataframe:
考虑这个数据框:
df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3],
'B': [10, 15, 20, 25, 30,35],
'C': [100, 150, 200, 250, 300, 350]},)
A B C
1 10 100
1 15 150
2 20 200
2 25 250
3 30 300
3 35 350
I used this to get values of column C for first rows of each group:
我用它来获取每组第一行的 C 列值:
firsts = df.groupby('A').first()['C']
So first will be: (100, 200, 300)
.
所以首先将是:(100, 200, 300)
。
Now I want to add new column which it will be '1' if value of column C for row is in firsts
otherwise it will be '0'.
现在我想添加新列,如果行的 C 列的值在,它将是 '1',firsts
否则它将是 '0'。
A B C D
1 10 100 1
1 15 150 0
2 20 200 1
2 25 250 0
3 30 300 1
3 35 350 0
I used this:
我用过这个:
df['D'] = df['C'].apply(lambda x: 1 if x in firsts else 0)
But the output is:
但输出是:
A B C D
1 10 100 0
1 15 150 0
2 20 200 0
2 25 250 0
3 30 300 0
3 35 350 0
I appreciate if anyone explain why my solution is wrong and what is actual solution to this problem.
如果有人解释为什么我的解决方案是错误的以及这个问题的实际解决方案是什么,我很感激。
回答by Psidom
You can use isin
method:
您可以使用isin
方法:
df['D'] = df.C.isin(firsts).astype(int)
df
# A B C D
#0 1 10 100 1
#1 1 15 150 0
#2 2 20 200 1
#3 2 25 250 0
#4 3 30 300 1
#5 3 35 350 0
The reason your approach fails is that python in
operator check the index of a Series instead of the values, the same as how a dictionary works:
您的方法失败的原因是 pythonin
运算符检查系列的索引而不是值,与字典的工作方式相同:
firsts
#A
#1 100
#2 200
#3 300
#Name: C, dtype: int64
1 in firsts
# True
100 in firsts
# False
2 in firsts
# True
200 in firsts
# False
Modifying your method as follows works:
修改您的方法如下有效:
firstSet = set(firsts)
df['C'].apply(lambda x: 1 if x in firstSet else 0)
#0 1
#1 0
#2 1
#3 0
#4 1
#5 0
#Name: C, dtype: int64
回答by MaxU
You can also do it in one step using .transform('first')
:
您也可以使用.transform('first')
以下步骤一步完成:
In [280]: df['D'] = df.groupby('A')['C'].transform('first').eq(df['C']).astype(np.int8)
In [281]: df
Out[281]:
A B C D
0 1 10 100 1
1 1 15 150 0
2 2 20 200 1
3 2 25 250 0
4 3 30 300 1
5 3 35 350 0
Explanation:GroupBy.transform('func')
returns us a vector with the same length as the original DF with applied func
说明:向GroupBy.transform('func')
我们返回一个与应用了原始 DF 的长度相同的向量func
In [14]: df.groupby('A')['C'].transform('first')
Out[14]:
0 100
1 100
2 200
3 200
4 300
5 300
Name: C, dtype: int64
In [15]: df.groupby('A')['C'].transform('max')
Out[15]:
0 150
1 150
2 250
3 250
4 350
5 350
Name: C, dtype: int64
In [16]: df.groupby('A')['C'].transform('min')
Out[16]:
0 100
1 100
2 200
3 200
4 300
5 300
Name: C, dtype: int64
In [17]: df.groupby('A')['C'].transform('mean')
Out[17]:
0 125
1 125
2 225
3 225
4 325
5 325
Name: C, dtype: int64
In [18]: df.groupby('A')['C'].transform('sum')
Out[18]:
0 250
1 250
2 450
3 450
4 650
5 650
Name: C, dtype: int64
回答by Friedrich
TL;DR:
特尔;博士:
df['newColumn'] = np.where((df.compareColumn.isin(yourlist)), TrueValue, FalseValue)
Another one-step method would be to use np.where()
and isin
.
另一种一步法是使用np.where()
和isin
。
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3],
'B': [10, 15, 20, 25, 30,35],
'C': [100, 150, 200, 250, 300, 350]})
df['D'] = np.where((df.B.isin(firsts)), 1, 0)
We use the return from isin
as the condition in np.where()
to return either
我们使用 return fromisin
作为条件 innp.where()
返回
1
whenTrue
0
whenFalse
1
什么时候True
0
什么时候False
and assign them to a new column in the same dataframe df['D']
.
并将它们分配给同一数据框中的新列df['D']
。
Note:np.where
allows more complex conditions with bitwiseoperators and replacement cases, i.e. 'bypass' on False
注意:np.where
允许使用按位运算符和替换情况的更复杂的条件,即“绕过”False
df['col1'] = np.where(((df['col1'] == df['col2']) &
(~df['col1'].str.startswith('r'))),
'replace', df['col1'])