COUNTIF 在 Pandas python 中具有多个条件的多列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24810526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
COUNTIF in pandas python over multiple columns with multiple conditions
提问by VictorHenry
I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:
我有一个数据集,我试图在其中确定每个人的风险因素数量。所以我有以下数据:
Person_ID Age Smoker Diabetes
001 30 Y N
002 45 N N
003 27 N Y
004 18 Y Y
005 55 Y Y
Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:
每个属性(年龄、吸烟者、糖尿病)都有自己的条件来确定它是否是一个风险因素。因此,如果年龄 >= 45,这是一个风险因素。吸烟者和糖尿病是“Y”的危险因素。我想要添加一个列,根据这些条件将每个人的风险因素数量相加。所以数据看起来像这样:
Person_ID Age Smoker Diabetes Risk_Factors
001 30 Y N 1
002 25 N N 0
003 27 N Y 1
004 18 Y Y 2
005 55 Y Y 3
I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:
我有一个示例数据集,我在 Excel 中玩弄它,我这样做的方法是使用 COUNTIF 公式,如下所示:
=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")
=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")
However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?
但是,我将使用的实际数据集对于 Excel 来说太大了,所以我正在学习 Python 的 Pandas。我希望我能提供一些我已经尝试过的例子,但坦率地说,我什至不知道从哪里开始。我查看了这个问题,但它并没有真正解决如何使用来自多个列的不同条件将其应用于整个新列。有什么建议?
采纳答案by ZJS
If you want to stick with pandas. You can use the following...
如果你想坚持使用熊猫。您可以使用以下...
Solution
解决方案
isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)
df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)
How it works
这个怎么运作
isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0 countRiskFactors - adds up the risk factors
isY - 是一个存储的 lambda 函数,用于检查单元格的值是否为 Y,否则返回 1,否则为 0 countRiskFactors - 将风险因素相加
the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.
最后一行使用 apply 方法,参数键设置为 1,它应用方法 -first 参数 - 沿 DataFrame 逐行应用并返回附加到 DataFrame 的系列。
output of print df
打印 df 的输出
Person_ID Age Smoker Diabetes Risk_Factors
0 1 30 Y N 1
1 2 45 N N 0
2 3 27 N Y 1
3 4 18 Y Y 2
4 5 55 Y Y 3
回答by user3846155
If you are starting from excel and want to go to the next evolution then I would recommend MS access. It will be a lot easier then learning Panda for python. You should just replace the CountIf() with:
如果您是从 excel 开始并想要进入下一个演变,那么我会推荐 MS access。比为 Python 学习 Panda 会容易得多。您应该将 CountIf() 替换为:
Risk Factor: IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)
风险因素:IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)
回答by exp1orer
I would do this the following way.
我会通过以下方式做到这一点。
- For each column, create a new boolean series using the column's condition
- Add those series row-wise
- 对于每一列,使用列的条件创建一个新的布尔系列
- 逐行添加这些系列
(Note that this is simpler if your Smoker and Diabetes column is already boolean (True/False) instead of in strings.)
(请注意,如果您的 Smoker 和 Diabetes 列已经是布尔值(真/假)而不是字符串,这会更简单。)
It might look like this:
它可能看起来像这样:
df = pd.DataFrame({'Age': [30,45,27,18,55],
'Smoker':['Y','N','N','Y','Y'],
'Diabetes': ['N','N','Y','Y','Y']})
Age Diabetes Smoker
0 30 N Y
1 45 N N
2 27 Y N
3 18 Y Y
4 55 Y Y
#Step 1
risk1 = df.Age > 45
risk2 = df.Smoker == "Y"
risk3 = df.Diabetes == "Y"
risk_df = pd.concat([risk1,risk2,risk3],axis=1)
Age Smoker Diabetes
0 False True False
1 False False False
2 False False True
3 False True True
4 True True True
df['Risk_Factors'] = risk_df.sum(axis=1)
Age Diabetes Smoker Risk_Factors
0 30 N Y 1
1 45 N N 0
2 27 Y N 1
3 18 Y Y 2
4 55 Y Y 3