Python 使用条件在熊猫数据框中生成新列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27041724/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:21:53  来源:igfitidea点击:

Using conditional to generate new column in pandas dataframe

pythonpandasconditionalcalculated-columns

提问by user3786999

I have a pandas dataframe that looks like this:

我有一个如下所示的 Pandas 数据框:

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

I'd like to create a new column based on the usedcolumn, so that the dflooks like this:

我想基于该used列创建一个新列,使其df看起来像这样:

   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial
  • Create a new alertcolumn based on
  • If usedis 1.0, alertshould be Full.
  • If usedis 0.0, alertshould be Empty.
  • Otherwise, alertshould be Partial.
  • 创建一个新alert列基于
  • 如果used1.0alert应该是Full
  • 如果used0.0alert应该是Empty
  • 否则,alert应该是Partial

What's the best way to do that?

这样做的最佳方法是什么?

回答by Ffisegydd

You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.applyto apply the function to each row. Note that you have to pass the keyword argument axis=1to ensure that it applies the function to rows.

您可以定义一个函数,该函数返回您的不同状态“Full”、“Partial”、“Empty”等,然后df.apply用于将该函数应用于每一行。请注意,您必须传递关键字参数axis=1以确保它将函数应用于行。

import pandas as pd

def alert(c):
  if c['used'] == 1.0:
    return 'Full'
  elif c['used'] == 0.0:
    return 'Empty'
  elif 0.0 < c['used'] < 1.0:
    return 'Partial'
  else:
    return 'Undefined'

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

#    portion  used    alert
# 0        1   1.0     Full
# 1        2   0.3  Partial
# 2        3   0.0    Empty
# 3        4   0.8  Partial

回答by Primer

Alternatively you could do:

或者你可以这样做:

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})

%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

Which gives the same output but runs about 100 times faster on 10000 rows:

它给出了相同的输出,但在 10000 行上运行速度提高了大约 100 倍:

100 loops, best of 3: 2.91 ms per loop

Then using apply:

然后使用应用:

%timeit df['alert'] = df.apply(alert, axis=1)

1 loops, best of 3: 287 ms per loop

I guess the choice depends on how big is your dataframe.

我想选择取决于您的数据框有多大。

回答by Zero

Use np.where, is usually fast

使用np.where, 通常很快

In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                 np.where(df.used == 0, 'Empty', 'Partial'))

In [846]: df
Out[846]:
   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial


Timings

时间安排

In [848]: df.shape
Out[848]: (100000, 3)

In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop

In [850]: %%timeit
     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
     ...:
10 loops, best of 3: 21.9 ms per loop

In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop

回答by Spcogg the second

Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get()method to make the function to pass in to .apply()easier to manage:

不能评论所以做一个新的答案:改进 Ffisegydd 的方法,您可以使用字典和dict.get()方法使函数传入.apply()更易于管理:

import pandas as pd

def alert(c):
    mapping = {1.0: 'Full', 0.0: 'Empty'}
    return mapping.get(c['used'], 'Partial')

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

Depending on the use case, you might like to define the dict outside of the function definition as well.

根据用例,您可能还想在函数定义之外定义 dict。

回答by user1857373

df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))

This would appear to work, except for the ValueError: either both or neither of x and y should be given

这似乎有效,但 ValueError 除外:应给出 x 和 y 或两者都不给出