Python 计算熊猫数据框中选定列的选定行的平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36454494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:53:38  来源:igfitidea点击:

Calculate mean for selected rows for selected columns in pandas data frame

pythonpandas

提问by impossible

I have pandas df with say, 100 rows, 10 columns, (actual data is huge). I also have row_index list which contains, which rows to be considered to take mean. I want to calculate mean on say columns 2,5,6,7 and 8. Can we do it with some function for dataframe object?

我有熊猫 df 说,100 行,10 列,(实际数据很大)。我也有 row_index 列表,其中包含哪些行被认为是平均的。我想计算第 2、5、6、7 和 8 列的平均值。我们可以用一些数据框对象的函数来做吗?

What I know is do a for loop, get value of row for each element in row_index and keep doing mean. Do we have some direct function where we can pass row_list, and column_list and axis, for ex df.meanAdvance(row_list,column_list,axis=0)?

我所知道的是做一个 for 循环,获取 row_index 中每个元素的行值并保持平均。我们是否有一些直接的函数,我们可以在其中传递 row_list、column_list 和轴,例如df.meanAdvance(row_list,column_list,axis=0)

I have seen DataFrame.mean() but it didn't help I guess.

我见过 DataFrame.mean() 但我猜它没有帮助。

  a b c d q 
0 1 2 3 0 5
1 1 2 3 4 5
2 1 1 1 6 1
3 1 0 0 0 0

I want mean of 0, 2, 3rows for each a, b, dcolumns

我想要0, 2, 3a, b, d列的行数

  a b d
0 1 1 2

采纳答案by PdevG

To select the rows of your dataframe you can use iloc, you can then select the columns you want using square brackets.

要选择数据框的行,您可以使用 iloc,然后您可以使用方括号选择所需的列。

For example:

例如:

 df = pd.DataFrame(data=[[1,2,3]]*5, index=range(3, 8), columns = ['a','b','c'])

gives the following dataframe:

给出以下数据框:

   a  b  c
3  1  2  3
4  1  2  3
5  1  2  3
6  1  2  3
7  1  2  3

to select only the 3d and fifth row you can do:

要仅选择 3d 和第五行,您可以执行以下操作:

df.iloc[[2,4]]

which returns:

返回:

   a  b  c
5  1  2  3
7  1  2  3

if you then want to select only columns b and c you use the following command:

如果您只想选择列 b 和 c,则使用以下命令:

df[['b', 'c']].iloc[[2,4]]

which yields:

产生:

   b  c
5  2  3
7  2  3

To then get the mean of this subset of your dataframe you can use the df.mean function. If you want the means of the columns you can specify axis=0, if you want the means of the rows you can specify axis=1

然后,您可以使用 df.mean 函数来获得数据帧的这个子集的平均值。如果你想要列的平均值,你可以指定axis=0,如果你想要行的平均值,你可以指定axis=1

thus:

因此:

df[['b', 'c']].iloc[[2,4]].mean(axis=0)

returns:

返回:

b    2
c    3

As we should expect from the input dataframe.

正如我们对输入数据帧所期望的那样。

For your code you can then do:

对于您的代码,您可以执行以下操作:

 df[column_list].iloc[row_index_list].mean(axis=0)

EDIT after comment: New question in comment: I have to store these means in another df/matrix. I have L1, L2, L3, L4...LX lists which tells me the index whose mean I need for columns C[1, 2, 3]. For ex: L1 = [0, 2, 3] , means I need mean of rows 0,2,3 and store it in 1st row of a new df/matrix. Then L2 = [1,4] for which again I will calculate mean and store it in 2nd row of the new df/matrix. Similarly till LX, I want the new df to have X rows and len(C) columns. Columns for L1..LX will remain same. Could you help me with this?

评论后编辑:评论中的新问题:我必须将这些方法存储在另一个 df/matrix 中。我有 L1、L2、L3、L4...LX 列表,它告诉我我需要列 C[1,2,3] 的平均值的索引。例如: L1 = [0, 2, 3] ,意味着我需要行 0,2,3 的平均值并将其存储在新 df/matrix 的第一行中。然后 L2 = [1,4] 为此我将再次计算平均值并将其存储在新 df/matrix 的第二行中。同样,直到 LX,我希望新的 df 具有 X 行和 len(C) 列。L1..LX 的列将保持不变。你能帮我解决这个问题吗?

Answer:

回答:

If i understand correctly, the following code should do the trick (Same df as above, as columns I took 'a' and 'b':

如果我理解正确,下面的代码应该可以解决问题(与上面的 df 相同,因为我采用了 'a' 和 'b' 列:

first you loop over all the lists of rows, collection all the means as pd.series, then you concatenate the resulting list of series over axis=1, followed by taking the transpose to get it in the right format.

首先循环遍历所有行列表,将所有均值收集为 pd.series,然后在轴 = 1 上连接结果序列列表,然后进行转置以获得正确的格式。

dfs = list()
for l in L:
    dfs.append(df[['a', 'b']].iloc[l].mean(axis=0))

mean_matrix = pd.concat(dfs, axis=1).T

回答by mfitzp

You can select specific columns from a DataFrame by passing a list of indices to .iloc, for example:

您可以通过将索引列表传递给.iloc,从 DataFrame 中选择特定列,例如:

df.iloc[:, [2,5,6,7,8]]

Will return a DataFrame containing those numbered columns (note: This uses 0-based indexing, so 2refers to the 3rd column.)

将返回一个包含这些编号列的 DataFrame(注意:这使用基于 0 的索引,因此2指的是第 3 列。)

To take a mean down of that column, you could use:

要降低该列的平均值,您可以使用:

# Mean along 0 (vertical) axis: return mean for specified columns, calculated across all rows
df.iloc[:, [2,5,6,7,8]].mean(axis=0) 

To take a mean across that column, you could use:

要在该列中取平均值,您可以使用:

# Mean along 1 (horizontal) axis: return mean for each row, calculated across specified columns
df.iloc[:, [2,5,6,7,8]].mean(axis=1)

You can also supply specific indices for both axes to return a subset of the table:

您还可以为两个轴提供特定索引以返回表的子集:

df.iloc[[1,2,3,4], [2,5,6,7,8]]

For your specific example, you would do:

对于您的具体示例,您将执行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame( 
np.array([[1,2,3,0,5],[1,2,3,4,5],[1,1,1,6,1],[1,0,0,0,0]]),
columns=["a","b","c","d","q"],
index = [0,1,2,3]
)

#I want mean of 0, 2, 3 rows for each a, b, d columns
#. a b d
#0 1 1 2

df.iloc[ [0,2,3], [0,1,3] ].mean(axis=0)

Which outputs:

哪些输出:

a    1.0
b    1.0
d    2.0
dtype: float64

Alternatively, to access via column names, first select on those:

或者,要通过列名访问,首先选择那些:

df[ ['a','b','d'] ].iloc[ [0,1,3] ].mean(axis=0)

To answer the second part of your question (from the comments) you can join multiple DataFrames together using pd.concat. It is faster to accumulate the frames in a list and then pass to pd.concatin one go, e.g.

要回答问题的第二部分(来自评论),您可以使用pd.concat. 将帧累积在列表中然后一次性传递给它会更快pd.concat,例如

dfs = []
for ix in idxs:
    dfm = df.iloc[ [0,2,3], ix ].mean(axis=0)
    dfs.append(dfm)

dfm_summary = pd.concat(dfs, axis=1) # Stack horizontally