Python 为什么我的 Pandas 'apply' 函数不能引用多列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16353729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why isn't my Pandas 'apply' function referencing multiple columns working?
提问by Andy
I have some problems with the Pandas apply function, when using multiple columns with the following dataframe
当使用具有以下数据框的多列时,我对 Pandas apply 函数有一些问题
df = DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
and the following function
和以下功能
def my_test(a, b):
return a % b
When I try to apply this function with :
当我尝试应用此功能时:
df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)
I get the error message:
我收到错误消息:
NameError: ("global name 'a' is not defined", u'occurred at index 0')
I do not understand this message, I defined the name properly.
我不明白这条消息,我正确定义了名称。
I would highly appreciate any help on this issue
我非常感谢在这个问题上的任何帮助
Update
更新
Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ''. However I still get the same issue using a more complex function such as:
谢谢你的帮助。我确实在代码中犯了一些语法错误,索引应该放在''。但是,我仍然使用更复杂的函数遇到同样的问题,例如:
def my_test(a):
cum_diff = 0
for ix in df.index():
cum_diff = cum_diff + (a - df['a'][ix])
return cum_diff
采纳答案by waitingkuo
Seems you forgot the ''of your string.
似乎你忘记了''你的字符串。
In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)
In [44]: df
Out[44]:
a b c Value
0 -1.674308 foo 0.343801 0.044698
1 -2.163236 bar -2.046438 -0.116798
2 -0.199115 foo -0.458050 -0.199115
3 0.918646 bar -0.007185 -0.001006
4 1.336830 foo 0.534292 0.268245
5 0.976844 bar -0.773630 -0.570417
BTW, in my opinion, following way is more elegant:
顺便说一句,在我看来,以下方式更优雅:
In [53]: def my_test2(row):
....: return row['a'] % row['c']
....:
In [54]: df['Value'] = df.apply(my_test2, axis=1)
回答by herrfz
If you just want to compute (column a) % (column b), you don't need apply, just do it directly:
如果你只是想计算(a列)%(b列),你不需要apply,直接做:
In [7]: df['a'] % df['c']
Out[7]:
0 -1.132022
1 -0.939493
2 0.201931
3 0.511374
4 -0.694647
5 -0.023486
Name: a
回答by Mir_Murtaza
Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df
假设我们想将函数 add5 应用于 DataFrame df 的“a”和“b”列
def add5(x):
return x+5
df[['a', 'b']].apply(add5)
回答by Blane
All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).
上述所有建议都有效,但如果您希望计算更高效,则应利用 numpy 向量运算(如此处所指出的)。
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
Example 1: looping with pandas.apply():
示例 1:循环使用pandas.apply():
%%timeit
def my_test2(row):
return row['a'] % row['c']
df['Value'] = df.apply(my_test2, axis=1)
The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 μs per loop
最慢的运行时间比最快的运行时间长 7.49 倍。这可能意味着正在缓存中间结果。1000 个循环,最好的 3 个:每个循环 481 μs
Example 2: vectorize using pandas.apply():
示例 2:矢量化使用pandas.apply():
%%timeit
df['a'] % df['c']
The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 μs per loop
最慢的运行时间比最快的运行时间长 458.85 倍。这可能意味着正在缓存中间结果。10000 个循环,最好的 3 个:每个循环 70.9 μs
Example 3: vectorize using numpy arrays:
示例 3:使用 numpy 数组进行矢量化:
%%timeit
df['a'].values % df['c'].values
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 μs per loop
最慢的运行时间比最快的运行时间长 7.98 倍。这可能意味着正在缓存中间结果。100000 个循环,最好的 3 个:每个循环 6.39 μs
So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.
因此,使用 numpy 数组进行矢量化将速度提高了近两个数量级。
回答by shaurya airi
This is same as the previous solution but I have defined the function in df.apply itself:
这与之前的解决方案相同,但我已经在 df.apply 本身中定义了该函数:
df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)
回答by Gursewak Singh
I have given the comparison of all three discussed above.
我已经给出了上面讨论的所有三个的比较。
Using values
使用值
%timeit df['value'] = df['a'].values % df['c'].values
139 μs ± 1.91 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
每个循环 139 μs ± 1.91 μs(平均值 ± 标准偏差,7 次运行,每次 10000 次循环)
Without values
没有价值观
%timeit df['value'] = df['a']%df['c']
216 μs ± 1.86 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
每个循环 216 μs ± 1.86 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)
Apply function
应用功能
%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)
474 μs ± 5.07 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
每个循环 474 μs ± 5.07 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)

