Python 熊猫根据其他列的条件添加具有值的列

Question

提问by Rutger Hofste

I have the following pandas dataframe:

我有以下熊猫数据框：

import pandas as pd
import numpy as np

d = {'age' : [21, 45, 45, 5],
     'salary' : [20, 40, 10, 100]}

df = pd.DataFrame(d)

and would like to add an extra column called "is_rich" which captures if a person is rich depending on his/her salary. I found multiple ways to accomplish this:

并想添加一个名为“is_rich”的额外列，它根据一个人的薪水来捕获他是否富有。我找到了多种方法来实现这一点：

# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')

# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]

# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

resulting in:

导致：

However I don't understand what the preferred way is. Are all methods equally good depending on your application?

但是我不明白首选的方式是什么。根据您的应用，所有方法是否都同样好？

Answer 1

回答by cs95

Use the timeits, Luke!

使用timeits，卢克！

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, locand numpy.whereperform better - vectorisation wins the day.

结论
列表推导式在少量数据上表现最佳，因为它们产生的开销非常小，即使它们没有被向量化。OTOH，在更大的数据，loc并numpy.where有更好的表现-矢量化胜天。

Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

请记住，方法的适用性取决于您的数据、条件数和列的数据类型。我的建议是在确定一个选项之前对您的数据测试各种方法。

One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.

然而，从这里可以肯定的是，列表推导式非常具有竞争力——它们是用 C 实现的，并且针对性能进行了高度优化。

Benchmarking code, for reference. Here are the functions being timed:

基准代码，供参考。以下是正在计时的功能：

def numpy_where(df):
  return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))

def list_comp(df):
  return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])

def loc(df):
  df = df.assign(is_rich='no')
  df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
  return df

Python 熊猫根据其他列的条件添加具有值的列

提问by Rutger Hofste

回答by cs95

相关推荐

最近更新

标签

Python 熊猫根据其他列的条件添加具有值的列

提问by Rutger Hofste

回答by cs95

相关推荐

Python 在 Pandas 数据框中的不同列上使用 lambda if 条件

Python Colaboratory：如何在本地机器上安装和使用？

python：pickle.load() 引发 EOFError

Python (pip) - RequestsDependencyWarning: urllib3 (1.9.1) 或 chardet (2.3.0) 与支持的版本不匹配

相关推荐

最近更新

标签