pandas 熊猫:用一些 numpy 数组填充一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18641148/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas: fill a column with some numpy arrays
提问by Nic
I am using python2.7 and pandas 0.11.0.
我正在使用 python2.7 和 Pandas 0.11.0。
I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1x3).
我尝试使用 DataFrame.apply(func) 填充数据框的一列。func() 函数应该返回一个 numpy 数组(1x3)。
import pandas as pd
import numpy as np
df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
print(df)
A B C
0 0.910142 0.788300 0.114164
1 -0.603282 -0.625895 2.843130
2 1.823752 -0.091736 -0.107781
3 0.447743 -0.163605 0.514052
The function used for testing purpose:
用于测试目的的函数:
def test(row):
# some complex calc here
# based on the values from different columns
return np.array((1,2,3))
df['D'] = df.apply(test, axis=1)
[...]
ValueError: Wrong number of items passed 1, indices imply 3
The funny is that when I create the dataframe from scratch, it works pretty well, and returns as expected:
有趣的是,当我从头开始创建数据框时,它运行良好,并按预期返回:
dic = {'A': {0: 0.9, 1: -0.6, 2: 1.8, 3: 0.4},
'C': {0: 0.1, 1: 2.8, 2: -0.1, 3: 0.5},
'B': {0: 0.7, 1: -0.6, 2: -0.1, 3: -0.1},
'D': {0:np.array((1,2,3)),
1:np.array((1,2,3)),
2:np.array((1,2,3)),
3:np.array((1,2,3))}}
df= pd.DataFrame(dic)
print(df)
A B C D
0 0.9 0.7 0.1 [1, 2, 3]
1 -0.6 -0.6 2.8 [1, 2, 3]
2 1.8 -0.1 -0.1 [1, 2, 3]
3 0.4 -0.1 0.5 [1, 2, 3]
Thanks in advance
提前致谢
回答by Viktor Kerkez
If you try to return multiple values from the function that is passed to apply, and the DataFrame you call the applyon has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:
如果您尝试从传递给 的函数返回多个值apply,并且您调用的 DataFrameapply沿轴(在本例中为列)的项目数与您返回的值数相同,Pandas 将创建一个 DataFrame来自与原始 DataFrame 具有相同标签的返回值。如果你只是这样做,你可以看到这一点:
>>> def test(row):
return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.
这就是您收到错误的原因,因为您无法将 DataFrame 分配给 DataFrame 列。
If you return any other number of values, it will return just a series object, that can be assigned:
如果您返回任何其他数量的值,它将只返回一个可以分配的系列对象:
>>> def test(row):
return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
A B C D
0 0.333535 0.209745 -0.972413 [1, 2]
1 0.469590 0.107491 -1.248670 [1, 2]
2 0.234444 0.093290 -0.853348 [1, 2]
3 1.021356 0.092704 -0.406727 [1, 2]
I'm not sure why Pandas does this, and why it does it only when the return value is a listor an ndarray, since it won't do it if you return a tuple:
我不确定 Pandas 为什么要这样做,以及为什么它只在返回值为 alist或 an时才ndarray这样做,因为如果您返回 a ,它就不会这样做tuple:
>>> def test(row):
return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
A B C D
0 0.121136 0.541198 -0.281972 (1, 2, 3)
1 0.569091 0.944344 0.861057 (1, 2, 3)
2 -1.742484 -0.077317 0.181656 (1, 2, 3)
3 -1.541244 0.174428 0.660123 (1, 2, 3)

