pandas 某些列的熊猫平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39317702/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:57:29  来源:igfitidea点击:

Pandas Mean for Certain Column

pythonpandasnumpy

提问by Keithx

I have a pandas dataframe like that:

我有一个像这样的Pandas数据框:

enter image description here

在此处输入图片说明

How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?

如果 Cluster==1 或 CLuster==2,我如何能够计算特定列的平均值(最小值/最大值、中位数)?

Thanks!

谢谢!

回答by Yaron

You can create new df with only the relevant rows, using:

您可以使用以下方法创建仅包含相关行的新 df:

newdf = df[df['cluster'].isin([1,2)]

newdf.mean(axis=1)

In order to calc mean of a specfic column you can:

为了计算特定列的平均值,您可以:

newdf["page"].mean(axis=1) 

回答by sparc_spread

If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupybyand agg:

如果您的意思是仅在 Cluster 为 1 或 2 时取平均值,那么此处的其他答案可以解决您的问题。如果您的意思是对 Cluster 的每个值采用单独的平均值,则可以使用 pandas 的聚合函数,包括groupybyagg

df.groupby("Cluster").mean()

is the simplest and will take means of all columns, grouped by Cluster.

是最简单的,将采用按集群分组的所有列。

df.groupby("Cluster").agg({"duration" : np.mean}) 

is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.

是一个示例,您只取一个特定列的平均值,按集群分组。你也可以使用np.minnp.maxnp.median,等。

The groupbymethod produces a GroupByobject, which is something like but not like a DataFrame. Think of it as the DataFramegrouped, waiting for aggregation to be applied to it. The GroupByobject has simple built-in aggregation functions that apply to all columns (the mean()in the first example), and also a more general aggregation function (the agg()in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dictof column names keyed to functions, so specific functions can be applied to specific columns.

groupby方法产生一个GroupBy对象,它类似于但不像 a DataFrame。将其视为DataFrame分组,等待对其应用聚合。该GroupBy对象具有适用于所有列的简单内置聚合函数(mean()第一个示例中的),以及更通用的聚合函数(agg()第二个示例中的),您可以使用它以多种方式应用特定函数。使用它的一种方法是将 a dictof 列名传递给函数,因此可以将特定函数应用于特定列。

回答by evan54

Simple intuitive answer

简单直观的答案

First pick the rows of interest, then average then pick the columns of interest.

首先选择感兴趣的行,然后平均然后选择感兴趣的列。

clusters_of_interest = [1, 2]
columns_of_interest = ['page']

# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ] 

More advanced

更先进

# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )

This is also a good linkwhich describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interestwhile the .aggfunction averages over each group of values having the same CLUSTERvalue.

这也是一个很好的链接,它描述了聚合技术。应该注意的是,“简单答案”对集群 1 和 2 或任何指定的集群clusters_of_interest.agg平均值,而函数对具有相同CLUSTER值的每组值求平均值。

回答by jotasi

You can do it in one line, using boolean indexing. For example you can do something like:

您可以使用boolean indexing在一行中完成。例如,您可以执行以下操作:

import numpy as np
import pandas as pd

# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:,   "Cluster"] *= 3

# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()

The boolean indexing array is Truefor the correct clusters. ais just the name of the column to compute the mean over.

布尔索引数组True用于正确的集群。a只是计算平均值的列的名称。