COUNTIF 在 Pandas python 中具有多个条件的多列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24810526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:15:45  来源:igfitidea点击:

COUNTIF in pandas python over multiple columns with multiple conditions

pythonpandasdataset

提问by VictorHenry

I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:

我有一个数据集,我试图在其中确定每个人的风险因素数量。所以我有以下数据:

Person_ID  Age  Smoker  Diabetes
      001   30       Y         N
      002   45       N         N
      003   27       N         Y
      004   18       Y         Y
      005   55       Y         Y

Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:

每个属性(年龄、吸烟者、糖尿病)都有自己的条件来确定它是否是一个风险因素。因此,如果年龄 >= 45,这是一个风险因素。吸烟者和糖尿病是“Y”的危险因素。我想要添加一个列,根据这些条件将每个人的风险因素数量相加。所以数据看起来像这样:

Person_ID  Age  Smoker  Diabetes  Risk_Factors
      001   30       Y         N             1
      002   25       N         N             0
      003   27       N         Y             1
      004   18       Y         Y             2
      005   55       Y         Y             3

I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:

我有一个示例数据集,我在 Excel 中玩弄它,我这样做的方法是使用 COUNTIF 公式,如下所示:

=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")

=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")

However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?

但是,我将使用的实际数据集对于 Excel 来说太大了,所以我正在学习 Python 的 Pandas。我希望我能提供一些我已经尝试过的例子,但坦率地说,我什至不知道从哪里开始。我查看了这个问题,但它并没有真正解决如何使用来自多个列的不同条件将其应用于整个新列。有什么建议?

采纳答案by ZJS

If you want to stick with pandas. You can use the following...

如果你想坚持使用熊猫。您可以使用以下...

Solution

解决方案

isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)

df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)

How it works

这个怎么运作

isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0 countRiskFactors - adds up the risk factors

isY - 是一个存储的 lambda 函数,用于检查单元格的值是否为 Y,否则返回 1,否则为 0 countRiskFactors - 将风险因素相加

the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.

最后一行使用 apply 方法,参数键设置为 1,它应用方法 -first 参数 - 沿 DataFrame 逐行应用并返回附加到 DataFrame 的系列。

output of print df

打印 df 的输出

   Person_ID  Age Smoker Diabetes  Risk_Factors
0          1   30      Y        N             1
1          2   45      N        N             0
2          3   27      N        Y             1
3          4   18      Y        Y             2
4          5   55      Y        Y             3

回答by user3846155

If you are starting from excel and want to go to the next evolution then I would recommend MS access. It will be a lot easier then learning Panda for python. You should just replace the CountIf() with:

如果您是从 excel 开始并想要进入下一个演变,那么我会推荐 MS access。比为 Python 学习 Panda 会容易得多。您应该将 CountIf() 替换为:

Risk Factor: IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)

风险因素:IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)

回答by exp1orer

I would do this the following way.

我会通过以下方式做到这一点。

  1. For each column, create a new boolean series using the column's condition
  2. Add those series row-wise
  1. 对于每一列,使用列的条件创建一个新的布尔系列
  2. 逐行添加这些系列

(Note that this is simpler if your Smoker and Diabetes column is already boolean (True/False) instead of in strings.)

(请注意,如果您的 Smoker 和 Diabetes 列已经是布尔值(真/假)而不是字符串,这会更简单。)

It might look like this:

它可能看起来像这样:

df = pd.DataFrame({'Age': [30,45,27,18,55],
                   'Smoker':['Y','N','N','Y','Y'],
                   'Diabetes': ['N','N','Y','Y','Y']})

   Age Diabetes Smoker
0   30        N      Y
1   45        N      N
2   27        Y      N
3   18        Y      Y
4   55        Y      Y

#Step 1
risk1 = df.Age > 45
risk2 = df.Smoker == "Y"
risk3 = df.Diabetes == "Y"
risk_df = pd.concat([risk1,risk2,risk3],axis=1)

     Age Smoker Diabetes
0  False   True    False
1  False  False    False
2  False  False     True
3  False   True     True
4   True   True     True

df['Risk_Factors'] = risk_df.sum(axis=1)

   Age Diabetes Smoker  Risk_Factors
0   30        N      Y             1
1   45        N      N             0
2   27        Y      N             1
3   18        Y      Y             2
4   55        Y      Y             3