Python Pandas:查找包含 numpy 数组的数据帧列中每一行的最大值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41108859/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:36:39  来源:igfitidea点击:

Python Pandas: Find the maximum for each row in a dataframe column containing a numpy array

pythonpandasnumpy

提问by Jannick

I got a Pandas DataFrame looking like the following:

我得到了一个如下所示的 Pandas DataFrame:

      values                                      max_val_idx
0    np.array([-0.649626, -0.662434, -0.611351])            2
1    np.array([-0.994942, -0.990448, -1.01574])             1
2    np.array([-1.012, -1.01034, -1.02732])                 0

df['values']contains numpy arrays of a fixed length of 3 elements
df['max_val_idx]contains the index of the maximum value of the corresponding array

df['values']包含 3 个元素的固定长度的 numpy 数组
df['max_val_idx]包含对应数组的最大值的索引

Since the index of the maximum element for each array is already given, what is the most efficient way to extract the maximum for each entry?
I know the data is stored somewhat silly, but I didn't create it myself. And since I got a bunch of data to process (+- 50GB, as several hundreds of pickled databases stored in a similar way), I'd like to know what is the most time efficient method.

由于已经给出了每个数组的最大元素的索引,那么提取每个条目的最大值的最有效方法是什么?
我知道数据存储有点傻,但我没有自己创建它。由于我有一堆数据要处理(+- 50GB,数百个以类似方式存储的腌制数据库),我想知道什么是最省时的方法。

So far I tried to loop over each element of df['max_val_idx]and use it as an index for each array found in df['values']:

到目前为止,我尝试遍历 的每个元素df['max_val_idx]并将其用作在 中找到的每个数组的索引df['values']

max_val = []         
for idx, values in enumerate(df['values']):
     max_val.append(values[int(df['max_val_idx'].iloc[idx])])

Is there any faster alternative to this?

有没有更快的替代方法?

回答by JohnE

I would just forget the 'max_val_idx' column. I don't think it saves time and actually is more of a pain for syntax. Sample data:

我只会忘记“max_val_idx”列。我不认为它可以节省时间,实际上对语法来说更痛苦。样本数据:

df = pd.DataFrame({ 'x': range(3) }).applymap( lambda x: np.random.randn(3) )

                                                   x
0  [-1.17106202376, -1.61211460669, 0.0198122724315]
1    [0.806819945736, 1.49139051675, -0.21434675401]
2  [-0.427272615966, 0.0939459129359, 0.496474566...

You could extract the max like this:

您可以像这样提取最大值:

df.applymap( lambda x: x.max() )

          x  
0  0.019812
1  1.491391
2  0.496475

But generally speaking, life is easier if you have one number per cell. If each cell has an array of length 3, you could rearrange like this:

但一般来说,如果每个细胞有一个数字,生活会更容易。如果每个单元格都有一个长度为 3 的数组,您可以像这样重新排列:

for i, v in enumerate(list('abc')): df[v] = df.x.map( lambda x: x[i] )
df = df[list('abc')]

          a         b         c
0 -1.171062 -1.612115  0.019812
1  0.806820  1.491391 -0.214347
2 -0.427273  0.093946  0.496475

And then do a standard pandas operation:

然后做一个标准的Pandas操作:

df.apply( max, axis=1 )

          x  
0  0.019812
1  1.491391
2  0.496475

Admittedly, this is not much easier than above, but overall the data will be much easier to work with in this form.

诚然,这并不比上面容易得多,但总的来说,以这种形式处理数据会容易得多。

回答by Scott Colby

I don't know how the speed of this will compare, since I'm constructing a 2D matrix of all the rows, but here's a possible solution:

我不知道这个速度将如何比较,因为我正在构建所有行的二维矩阵,但这里有一个可能的解决方案:

>>> np.choose(df['max_val_idx'], np.array(df['values'].tolist()).T)
0   -0.611351
1   -0.990448
2   -1.012000