Python 为什么我的 Pandas 'apply' 函数不能引用多列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16353729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:22:43  来源:igfitidea点击:

Why isn't my Pandas 'apply' function referencing multiple columns working?

pythonpython-2.7pandasdataframeapply

提问by Andy

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

当使用具有以下数据框的多列时,我对 Pandas apply 函数有一些问题

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

和以下功能

def my_test(a, b):
    return a % b

When I try to apply this function with :

当我尝试应用此功能时:

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

我收到错误消息:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

我不明白这条消息,我正确定义了名称。

I would highly appreciate any help on this issue

我非常感谢在这个问题上的任何帮助

Update

更新

Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ''. However I still get the same issue using a more complex function such as:

谢谢你的帮助。我确实在代码中犯了一些语法错误,索引应该放在''。但是,我仍然使用更复杂的函数遇到同样的问题,例如:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

采纳答案by waitingkuo

Seems you forgot the ''of your string.

似乎你忘记了''你的字符串。

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

顺便说一句,在我看来,以下方式更优雅:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

回答by herrfz

If you just want to compute (column a) % (column b), you don't need apply, just do it directly:

如果你只是想计算(a列)%(b列),你不需要apply,直接做:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

回答by Mir_Murtaza

Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df

假设我们想将函数 add5 应用于 DataFrame df 的“a”和“b”列

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

回答by Blane

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).

上述所有建议都有效,但如果您希望计算更高效,则应利用 numpy 向量运算(如此处所指出的)

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

Example 1: looping with pandas.apply():

示例 1:循环使用pandas.apply()

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 μs per loop

最慢的运行时间比最快的运行时间长 7.49 倍。这可能意味着正在缓存中间结果。1000 个循环,最好的 3 个:每个循环 481 μs

Example 2: vectorize using pandas.apply():

示例 2:矢量化使用pandas.apply()

%%timeit
df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 μs per loop

最慢的运行时间比最快的运行时间长 458.85 倍。这可能意味着正在缓存中间结果。10000 个循环,最好的 3 个:每个循环 70.9 μs

Example 3: vectorize using numpy arrays:

示例 3:使用 numpy 数组进行矢量化:

%%timeit
df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 μs per loop

最慢的运行时间比最快的运行时间长 7.98 倍。这可能意味着正在缓存中间结果。100000 个循环,最好的 3 个:每个循环 6.39 μs

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.

因此,使用 numpy 数组进行矢量化将速度提高了近两个数量级。

回答by shaurya airi

This is same as the previous solution but I have defined the function in df.apply itself:

这与之前的解决方案相同,但我已经在 df.apply 本身中定义了该函数:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

回答by Gursewak Singh

I have given the comparison of all three discussed above.

我已经给出了上面讨论的所有三个的比较。

Using values

使用值

%timeit df['value'] = df['a'].values % df['c'].values

139 μs ± 1.91 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

每个循环 139 μs ± 1.91 μs(平均值 ± 标准偏差,7 次运行,每次 10000 次循环)

Without values

没有价值观

%timeit df['value'] = df['a']%df['c'] 

216 μs ± 1.86 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 216 μs ± 1.86 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)

Apply function

应用功能

%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

474 μs ± 5.07 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 474 μs ± 5.07 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)