Pandas DataFrame 上的条件均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44787916/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Conditional mean over a Pandas DataFrame
提问by Oliver G
I have a dataset from which I want a few averages of multiple variables I created.
我有一个数据集,我想要从中创建多个变量的平均值。
I started off with:
我开始于:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
^ 效果很好,并为我提供了我正在寻找的平均值。
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
现在,我正在尝试做一个条件均值,因此该均值仅适用于数据集中的一个选择组。(我希望在 2016 年选举中投票支持的意识形态被分解)在 Stata 中,代码类似于:mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
研究它,我得出的结论是条件均值不是一回事(虽然希望我错了?),所以我正在为它编写自己的函数。
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
这是我刚开始使用“均值”函数,为条件均值函数创建基础:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index]
within the loop prints nan
over and over again.
然而,我不断得到“nan”作为结果。打印data2['socialIdeology2'][index]
循环内打印nan
一遍又一遍。
So my question is: if the data stored within the socialIdeology2
variable really is a nan
(which I don't understand how it could be), why is it that the .mean()
function works with it?
所以我的问题是:如果存储在socialIdeology2
变量中的数据真的是一个nan
(我不明白它是怎么回事),为什么.mean()
函数可以使用它?
And how can I get generate means by category?
以及如何按类别获得生成方法?
回答by Brad Solomon
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby()
:
条件均值确实是Pandas中的一件事。您可以使用DataFrame.groupby()
:
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
或者,在您的情况下,以下方法会更有效:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice
is the name of the column you want to condition on.
深入到您正在寻找的平均值。(第一种情况将计算所有列的均值。)这是假设这voteChoice
是您要作为条件的列的名称。
回答by ali_m
If you're only interested in the mean for a singlegroup (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
如果您只对单个组(例如克林顿选民)的平均值感兴趣,那么您可以创建一个对该组成员为 True 的布尔系列,然后在取平均值之前使用它来索引 DataFrame 的行:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby
, as in Brad's answer. However, I would do it like this:
如果您想同时获得多个组的方法,那么您可以使用groupby
,如 Brad 的回答。但是,我会这样做:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2']
index before the .mean()
means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean()
(i.e. data2.groupby('voteChoice').mean()['socialIdeology2']
) this computes the means over allcolumns and then selects only the 'socialIdeology2'
column from the result, which is less efficient.
将['socialIdeology2']
索引放在前面.mean()
意味着你只计算你感兴趣的列的平均值,而如果你将索引表达式放在.mean()
(ie data2.groupby('voteChoice').mean()['socialIdeology2']
)之后,这将计算所有列的平均值,然后只'socialIdeology2'
从结果中选择列,这是效率较低的。
See herefor more info on indexing DataFrames using .loc
and herefor more info on groupby
.