Python pandas 数据框中选定列中值的唯一组合和计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35268817/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
unique combinations of values in selected columns in pandas data frame and count
提问by Ratchainant Thammasudjarit
I have my data in pandas data frame as follows:
我在熊猫数据框中的数据如下:
df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
So, my data looks like this
所以,我的数据看起来像这样
----------------------------
index A B
0 yes yes
1 yes no
2 yes no
3 yes no
4 no yes
5 no yes
6 yes no
7 yes yes
8 yes yes
9 no no
-----------------------------
I would like to transform it to another data frame. The expected output can be shown in the following python script:
我想将其转换为另一个数据框。预期的输出可以显示在以下 python 脚本中:
output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})
So, my expected output looks like this
所以,我的预期输出看起来像这样
--------------------------------------------
index A B count
--------------------------------------------
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
--------------------------------------------
Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()
实际上,我可以使用以下命令找到所有组合并计算它们: mytable = df1.groupby(['A','B']).size()
However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.
然而,事实证明这些组合都在一个列中。我想将组合中的每个值分成不同的列,并为计数结果再添加一列。有可能这样做吗?我可以有你的建议吗?先感谢您。
采纳答案by EdChum
You can groupby
on cols 'A' and 'B' and call size
and then reset_index
and rename
the generated column:
您可以groupby
在 cols 'A' 和 'B' 上调用size
然后reset_index
和rename
生成的列:
In [26]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
update
更新
A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size
which returns the number of unique groups:
稍微解释一下,通过对 2 列进行分组,这对 A 和 B 值相同的行进行分组,我们称之为size
返回唯一组的数量:
In[202]:
df1.groupby(['A','B']).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
So now to restore the grouped columns, we call reset_index
:
所以现在要恢复分组列,我们调用reset_index
:
In[203]:
df1.groupby(['A','B']).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
This restores the indices but the size aggregation is turned into a generated column 0
, so we have to rename this:
这将恢复索引,但大小聚合变成了一个生成的 column 0
,所以我们必须重命名它:
In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
groupby
does accept the arg as_index
which we could have set to False
so it doesn't make the grouped columns the index, but this generates a series
and you'd still have to restore the indices and so on....:
groupby
确实接受as_index
我们可以设置的 arg ,False
因此它不会使分组列成为索引,但这会生成 aseries
并且您仍然必须恢复索引等等....:
In[205]:
df1.groupby(['A','B'], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
回答by Martin Alexandersson
Slightly related, I was looking for the unique combinations and I came up with this method:
稍微相关,我正在寻找独特的组合,我想出了这个方法:
def unique_columns(df,columns):
result = pd.Series(index = df.index)
groups = meta_data_csv.groupby(by = columns)
for name,group in groups:
is_unique = len(group) == 1
result.loc[group.index] = is_unique
assert not result.isnull().any()
return result
And if you only want to assert that all combinations are unique:
如果您只想断言所有组合都是唯一的:
df1.set_index(['A','B']).index.is_unique
回答by Paul Rougieux
Placing @EdChum's very nice answer into a function count_unique_index
.
The unique method only works on pandas series, not on data frames.
The function below reproduces the behavior of the uniquefunction in R:
将@EdChum 非常好的答案放入一个函数中count_unique_index
。独特的方法仅适用于熊猫系列,不适用于数据框。下面的函数再现了R 中唯一函数的行为:
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
unique 返回一个向量、数据框或数组,如 x 但删除了重复的元素/行。
And adds a count of the occurrences as requested by the OP.
并根据 OP 的要求添加出现次数。
df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
def count_unique_index(df, by):
return df.groupby(by).size().reset_index().rename(columns={0:'count'})
count_unique_index(df1, ['A','B'])
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3