Python 使用条件在熊猫数据框中生成新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27041724/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using conditional to generate new column in pandas dataframe
提问by user3786999
I have a pandas dataframe that looks like this:
我有一个如下所示的 Pandas 数据框:
portion used
0 1 1.0
1 2 0.3
2 3 0.0
3 4 0.8
I'd like to create a new column based on the usedcolumn, so that the dflooks like this:
我想基于该used列创建一个新列,使其df看起来像这样:
portion used alert
0 1 1.0 Full
1 2 0.3 Partial
2 3 0.0 Empty
3 4 0.8 Partial
- Create a new
alertcolumn based on - If
usedis1.0,alertshould beFull. - If
usedis0.0,alertshould beEmpty. - Otherwise,
alertshould bePartial.
- 创建一个新
alert列基于 - 如果
used是1.0,alert应该是Full。 - 如果
used是0.0,alert应该是Empty。 - 否则,
alert应该是Partial。
What's the best way to do that?
这样做的最佳方法是什么?
回答by Ffisegydd
You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.applyto apply the function to each row. Note that you have to pass the keyword argument axis=1to ensure that it applies the function to rows.
您可以定义一个函数,该函数返回您的不同状态“Full”、“Partial”、“Empty”等,然后df.apply用于将该函数应用于每一行。请注意,您必须传递关键字参数axis=1以确保它将函数应用于行。
import pandas as pd
def alert(c):
if c['used'] == 1.0:
return 'Full'
elif c['used'] == 0.0:
return 'Empty'
elif 0.0 < c['used'] < 1.0:
return 'Partial'
else:
return 'Undefined'
df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
df['alert'] = df.apply(alert, axis=1)
# portion used alert
# 0 1 1.0 Full
# 1 2 0.3 Partial
# 2 3 0.0 Empty
# 3 4 0.8 Partial
回答by Primer
Alternatively you could do:
或者你可以这样做:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})
%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
Which gives the same output but runs about 100 times faster on 10000 rows:
它给出了相同的输出,但在 10000 行上运行速度提高了大约 100 倍:
100 loops, best of 3: 2.91 ms per loop
Then using apply:
然后使用应用:
%timeit df['alert'] = df.apply(alert, axis=1)
1 loops, best of 3: 287 ms per loop
I guess the choice depends on how big is your dataframe.
我想选择取决于您的数据框有多大。
回答by Zero
Use np.where, is usually fast
使用np.where, 通常很快
In [845]: df['alert'] = np.where(df.used == 1, 'Full',
np.where(df.used == 0, 'Empty', 'Partial'))
In [846]: df
Out[846]:
portion used alert
0 1 1.0 Full
1 2 0.3 Partial
2 3 0.0 Empty
3 4 0.8 Partial
Timings
时间安排
In [848]: df.shape
Out[848]: (100000, 3)
In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop
In [850]: %%timeit
...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
...:
10 loops, best of 3: 21.9 ms per loop
In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop
回答by Spcogg the second
Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get()method to make the function to pass in to .apply()easier to manage:
不能评论所以做一个新的答案:改进 Ffisegydd 的方法,您可以使用字典和dict.get()方法使函数传入.apply()更易于管理:
import pandas as pd
def alert(c):
mapping = {1.0: 'Full', 0.0: 'Empty'}
return mapping.get(c['used'], 'Partial')
df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})
df['alert'] = df.apply(alert, axis=1)
Depending on the use case, you might like to define the dict outside of the function definition as well.
根据用例,您可能还想在函数定义之外定义 dict。
回答by user1857373
df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))
This would appear to work, except for the ValueError: either both or neither of x and y should be given
这似乎有效,但 ValueError 除外:应给出 x 和 y 或两者都不给出

