Python 熊猫根据其他列的条件添加具有值的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50375985/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:28:35  来源:igfitidea点击:

Pandas add column with value based on condition based on other columns

pythonpandas

提问by Rutger Hofste

I have the following pandas dataframe:

我有以下熊猫数据框:

enter image description here

在此处输入图片说明

import pandas as pd
import numpy as np

d = {'age' : [21, 45, 45, 5],
     'salary' : [20, 40, 10, 100]}

df = pd.DataFrame(d)

and would like to add an extra column called "is_rich" which captures if a person is rich depending on his/her salary. I found multiple ways to accomplish this:

并想添加一个名为“is_rich”的额外列,它根据一个人的薪水来捕获他是否富有。我找到了多种方法来实现这一点:

# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')

# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]

# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

resulting in:

导致:

enter image description here

在此处输入图片说明

However I don't understand what the preferred way is. Are all methods equally good depending on your application?

但是我不明白首选的方式是什么。根据您的应用,所有方法是否都同样好?

回答by cs95

Use the timeits, Luke!

使用timeits,卢克!

enter image description here

在此处输入图片说明

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, locand numpy.whereperform better - vectorisation wins the day.

结论
列表推导式在少量数据上表现最佳,因为它们产生的开销非常小,即使它们没有被向量化。OTOH,在更大的数据,locnumpy.where有更好的表现-矢量化胜天。

Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

请记住,方法的适用性取决于您的数据、条件数和列的数据类型。我的建议是在确定一个选项之前对您的数据测试各种方法。

One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.

然而,从这里可以肯定的是,列表推导式非常具有竞争力——它们是用 C 实现的,并且针对性能进行了高度优化。



Benchmarking code, for reference. Here are the functions being timed:

基准代码,供参考。以下是正在计时的功能:

def numpy_where(df):
  return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))

def list_comp(df):
  return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])

def loc(df):
  df = df.assign(is_rich='no')
  df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
  return df