Python 使用条件在熊猫数据框中生成新列

Question

提问by user3786999

I have a pandas dataframe that looks like this:

我有一个如下所示的 Pandas 数据框：

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

I'd like to create a new column based on the usedcolumn, so that the dflooks like this:

我想基于该used列创建一个新列，使其df看起来像这样：

   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

Create a new alertcolumn based on
If usedis 1.0, alertshould be Full.
If usedis 0.0, alertshould be Empty.
Otherwise, alertshould be Partial.

创建一个新alert列基于
如果used是1.0，alert应该是Full。
如果used是0.0，alert应该是Empty。
否则，alert应该是Partial。

What's the best way to do that?

这样做的最佳方法是什么？

Answer 1

回答by Ffisegydd

You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.applyto apply the function to each row. Note that you have to pass the keyword argument axis=1to ensure that it applies the function to rows.

您可以定义一个函数，该函数返回您的不同状态“Full”、“Partial”、“Empty”等，然后df.apply用于将该函数应用于每一行。请注意，您必须传递关键字参数axis=1以确保它将函数应用于行。

import pandas as pd

def alert(c):
  if c['used'] == 1.0:
    return 'Full'
  elif c['used'] == 0.0:
    return 'Empty'
  elif 0.0 < c['used'] < 1.0:
    return 'Partial'
  else:
    return 'Undefined'

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

#    portion  used    alert
# 0        1   1.0     Full
# 1        2   0.3  Partial
# 2        3   0.0    Empty
# 3        4   0.8  Partial

Answer 2

回答by Primer

Alternatively you could do:

或者你可以这样做：

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})

%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

Which gives the same output but runs about 100 times faster on 10000 rows:

它给出了相同的输出，但在 10000 行上运行速度提高了大约 100 倍：

100 loops, best of 3: 2.91 ms per loop

Then using apply:

然后使用应用：

%timeit df['alert'] = df.apply(alert, axis=1)

1 loops, best of 3: 287 ms per loop

I guess the choice depends on how big is your dataframe.

我想选择取决于您的数据框有多大。

Answer 3

回答by Zero

Use np.where, is usually fast

使用np.where, 通常很快

In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                 np.where(df.used == 0, 'Empty', 'Partial'))

In [846]: df
Out[846]:
   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

_Timings

_时间安排

In [848]: df.shape
Out[848]: (100000, 3)

In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop

In [850]: %%timeit
     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
     ...:
10 loops, best of 3: 21.9 ms per loop

In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop

Answer 4

回答by Spcogg the second

Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get()method to make the function to pass in to .apply()easier to manage:

不能评论所以做一个新的答案：改进 Ffisegydd 的方法，您可以使用字典和dict.get()方法使函数传入.apply()更易于管理：

import pandas as pd

def alert(c):
    mapping = {1.0: 'Full', 0.0: 'Empty'}
    return mapping.get(c['used'], 'Partial')

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

Depending on the use case, you might like to define the dict outside of the function definition as well.

根据用例，您可能还想在函数定义之外定义 dict。

Answer 5

回答by user1857373

df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))

This would appear to work, except for the ValueError: either both or neither of x and y should be given

这似乎有效，但 ValueError 除外：应给出 x 和 y 或两者都不给出

Python 使用条件在熊猫数据框中生成新列

提问by user3786999

回答by Ffisegydd

回答by Primer

回答by Zero

回答by Spcogg the second

回答by user1857373

相关推荐

最近更新

标签

Python 使用条件在熊猫数据框中生成新列

提问by user3786999

回答by Ffisegydd

回答by Primer

回答by Zero

回答by Spcogg the second

回答by user1857373

相关推荐

Python 在 py.test 中的每个测试之前和之后运行代码？

Python 查找列表（数组）的最小值最大值和平均值

Python float' 对象没有属性 'lower'

如何将命令行参数传递给 ipython

相关推荐

最近更新

标签