Python pandas 数据框中选定列中值的唯一组合和计数

Question

提问by Ratchainant Thammasudjarit

I have my data in pandas data frame as follows:

我在熊猫数据框中的数据如下：

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

So, my data looks like this

所以，我的数据看起来像这样

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

I would like to transform it to another data frame. The expected output can be shown in the following python script:

我想将其转换为另一个数据框。预期的输出可以显示在以下 python 脚本中：

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

So, my expected output looks like this

所以，我的预期输出看起来像这样

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()

实际上，我可以使用以下命令找到所有组合并计算它们： mytable = df1.groupby(['A','B']).size()

However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.

然而，事实证明这些组合都在一个列中。我想将组合中的每个值分成不同的列，并为计数结果再添加一列。有可能这样做吗？我可以有你的建议吗？先感谢您。

Answer 1

采纳答案by EdChum

You can groupbyon cols 'A' and 'B' and call sizeand then reset_indexand renamethe generated column:

您可以groupby在 cols 'A' 和 'B' 上调用size然后reset_index和rename生成的列：

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

update

更新

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call sizewhich returns the number of unique groups:

稍微解释一下，通过对 2 列进行分组，这对 A 和 B 值相同的行进行分组，我们称之为size返回唯一组的数量：

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

So now to restore the grouped columns, we call reset_index:

所以现在要恢复分组列，我们调用reset_index：

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

这将恢复索引，但大小聚合变成了一个生成的 column 0，所以我们必须重命名它：

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupbydoes accept the arg as_indexwhich we could have set to Falseso it doesn't make the grouped columns the index, but this generates a seriesand you'd still have to restore the indices and so on....:

groupby确实接受as_index我们可以设置的 arg ，False因此它不会使分组列成为索引，但这会生成 aseries并且您仍然必须恢复索引等等....：

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

Answer 2

回答by Martin Alexandersson

Slightly related, I was looking for the unique combinations and I came up with this method:

稍微相关，我正在寻找独特的组合，我想出了这个方法：

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result

And if you only want to assert that all combinations are unique:

如果您只想断言所有组合都是唯一的：

df1.set_index(['A','B']).index.is_unique

Answer 3

回答by Paul Rougieux

Placing @EdChum's very nice answer into a function count_unique_index. The unique method only works on pandas series, not on data frames. The function below reproduces the behavior of the uniquefunction in R:

将@EdChum 非常好的答案放入一个函数中count_unique_index。独特的方法仅适用于熊猫系列，不适用于数据框。下面的函数再现了R 中唯一函数的行为：

unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

unique 返回一个向量、数据框或数组，如 x 但删除了重复的元素/行。

And adds a count of the occurrences as requested by the OP.

并根据 OP 的要求添加出现次数。

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})                                                                                               
def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

Python pandas 数据框中选定列中值的唯一组合和计数

提问by Ratchainant Thammasudjarit

采纳答案by EdChum

回答by Martin Alexandersson

回答by Paul Rougieux

相关推荐

最近更新

标签

Python pandas 数据框中选定列中值的唯一组合和计数

提问by Ratchainant Thammasudjarit

采纳答案by EdChum

回答by Martin Alexandersson

回答by Paul Rougieux

相关推荐

Python “导入错误：没有名为 twilio.rest 的模块”

OpenCV 和 Python - 图像太大而无法显示

Python 没有名为 django 的模块，但已安装

Python 如何查找字符串中所有出现的单词的所有索引

相关推荐

最近更新

标签