pandas group by with mode as aggregator
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36508487/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
group by with mode as aggregator
提问by Josh
I've got a set of survey responses that I'm trying to analyze with pandas. My goal is to find (for this example) the most common gender in each county in the US, so I use the following code:
我有一组我正在尝试用Pandas分析的调查回复。我的目标是找到(对于这个例子)美国每个县最常见的性别,所以我使用以下代码:
import pandas as pd
from scipy import stats
file['sex'].groupby(file['county']).agg([('modeSex', stats.mode)])
The output is:
输出是:
How can I unpack this to only get the mode value and not the second value that tells how often the mode occurs?
我怎样才能解压它来只获得模式值而不是告诉模式发生频率的第二个值?
Here is a sample of the data frame:
这是数据框的示例:
county|sex
----------
079 | 1
----------
079 | 2
----------
079 | 2
----------
075 | 1
----------
075 | 1
----------
075 | 1
----------
075 | 2
Desired output is:
期望的输出是:
county|modeSex
----------
079 | 2
----------
075 | 1
采纳答案by ayhan
Pandas is complaining about the returning array (I guess a pandas cellcannot hold a numpy array) when you use stats.mode(x)[0] so you can convert it to a list or a tuple:
当您使用 stats.mode(x)[0] 时,Pandas 抱怨返回数组(我猜一个 Pandas单元不能容纳一个 numpy 数组),因此您可以将其转换为列表或元组:
df = pd.DataFrame({"C1": np.random.randint(10, size=100), "C2": np.random.choice(["X", "Y", "Z"], size=100)})
print(df.groupby(['C2']).agg(lambda x: tuple(stats.mode(x)[0])))
Out:
出去:
C1
C2
X (0,)
Y (4,)
Z (3,)
Since there can be multiple modes, if you want to keep all of them you'll need tuples or lists. If you want the first mode, you can extract that:
由于可以有多种模式,如果您想保留所有模式,则需要元组或列表。如果你想要第一种模式,你可以提取:
df.groupby(['C2']).agg(lambda x: stats.mode(x)[0][0])
Out:
C1
C2
X 0
Y 4
Z 3
回答by sid
scipy.stats.mode returns array of modal values, array of counts for each mode
so we can use stats.mode(a)[0]
to return only first value
scipy.stats.mode 返回array of modal values, array of counts for each mode
所以我们可以使用stats.mode(a)[0]
只返回第一个值
here is the code
这是代码
import pandas as pd
from scipy import stats
# sample data frame
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
# use lambda functions
print df2.groupby(['X']).agg({'Y': lambda x:stats.mode(x)[0]})
output:
输出:
y
X
A 3
B 1