Python Pandas 有条件地创建系列/数据框列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19913659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:57:01  来源:igfitidea点击:

Pandas conditional creation of a series/dataframe column

pythonpandasnumpydataframe

提问by user7289

I have a dataframe along the lines of the below:

我有一个如下所示的数据框:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which sets a colour green if Set = 'Z' and 'red' if Set = otherwise.

我想将另一列添加到与数据帧相同长度的数据帧(或生成一系列)(= 相同数量的记录/行),如果 Set = 'Z' 和 'red' 如果 Set = 否则设置颜色为绿色.

What's the best way to do this?

做到这一点的最佳方法是什么?

采纳答案by unutbu

If you only have two choices to select from:

如果您只有两种选择可供选择:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

产量

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red


If you have more than two conditions then use np.select. For example, if you want colorto be

如果您有两个以上的条件,则使用np.select. 例如,如果你想color成为

  • yellowwhen (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise bluewhen (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purplewhen (df['Type'] == 'B')
  • otherwise black,
  • yellow什么时候 (df['Set'] == 'Z') & (df['Type'] == 'A')
  • 否则blue(df['Set'] == 'Z') & (df['Type'] == 'B')
  • 否则purple(df['Type'] == 'B')
  • 否则black

then use

然后使用

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

这产生

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

回答by acharuva

Another way in which this could be achieved is

实现这一目标的另一种方法是

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

回答by cheekybastard

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

列表推导式是另一种有条件地创建另一列的方法。如果您在列中使用对象 dtypes,就像在您的示例中一样,列表推导式通常优于大多数其他方法。

Example list comprehension:

示例列表理解:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

%timeit 测试:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 μs per loop
1000 loops, best of 3: 523 μs per loop
1000 loops, best of 3: 263 μs per loop

回答by bli

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

下面的方法比这里计时的方法慢,但是我们可以根据多列的内容计算额外的列,并且可以为额外的列计算两个以上的值。

Simple example using just the "Set" column:

仅使用“设置”列的简单示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

考虑了更多颜色和更多列的示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

编辑(21/06/2019):使用plydata

It is also possible to use plydatato do this kind of things (this seems even slower than using assignand apply, though).

也可以使用plydata来做这种事情(虽然这似乎比使用assignand更慢apply)。

from plydata import define, if_else

Simple if_else:

简单if_else

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

嵌套if_else

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

回答by blacksite

Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

这是给这只猫剥皮的另一种方法,使用字典将新值映射到列表中的键上:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What's it look like:

它是什么样子的:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

当您要进行许多ifelse-type 语句(即要替换许多唯一值)时,这种方法可能非常强大。

And of course you could always do this:

当然,你总是可以这样做:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the applyapproach from above, on my machine.

但是apply在我的机器上,这种方法比从上面的方法慢三倍多。

And you could also do this, using dict.get:

你也可以这样做,使用dict.get

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

回答by Hossein

Maybe this has been possible with newer updates of Pandas, but I think the following is the shortest and maybe best answer for the question, so far. You can use the .locmethod and use one condition or several depending on your need.

也许这可以通过 Pandas 的更新来实现,但我认为以下是迄今为止对该问题的最短且可能是最好的答案。您可以使用该.loc方法并根据需要使用一种或多种条件。

Code Summary:

代码摘要:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

解释:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a 'color' column and set all values to "red"

添加“颜色”列并将所有值设置为“红色”

df['Color'] = "red"

Apply your single condition:

应用您的单一条件:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

或多个条件,如果你想:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

您可以在此处阅读 Pandas 逻辑运算符和条件选择: Pandas 中布尔索引的逻辑运算符

回答by Jaroslav Bezděk

One liner with .apply()method is following:

一种带.apply()方法的衬垫如下:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, dfdata frame looks like this:

之后,df数据框如下所示:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

回答by Yaakov Bressler

If you're working with massive data, a memoized approach would be best:

如果您正在处理大量数据,最好使用记忆方法:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values.My general rule of thumb is to memoize when: data_size> 10**4& n_distinct< data_size/4

当您有许多重复值时,这种方法将是最快的。我的一般经验法则是记住以下情况:data_size> 10**4& n_distinct<data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

Ex Memoize 在 10,000 行的情况下具有 2,500 个或更少的不同值。