pandas 获取熊猫数据框中所有唯一行的计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34255882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:22:52  来源:igfitidea点击:

Get count of all unique rows in pandas dataframe

pythonnumpypandas

提问by Yashu Seth

I have a Pandas DataFrame -

我有一个 Pandas DataFrame -

>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,3)),
...                       columns=['A', 'B', 'C'])
>>> data
   A  B  C
0  0  1  0
1  1  0  1
2  1  0  1
3  0  1  1
4  1  1  0

Now I use this to get the count of rows only for column A

现在我用它来获取 A 列的行数

>>> data.ix[:, 'A'].value_counts()
1    3
0    2
dtype: int64

What is the most efficient way to get the count of rows for column A and B i.e something like the following output -

获取 A 列和 B 列的行数的最有效方法是什么,即类似于以下输出 -

0    0    0
0    1    2
1    0    2
1    1    1

And then finally how can I convert it into a numpy array such as -

然后最后我怎么能把它转换成一个 numpy 数组,比如 -

array([[0, 2],
       [2, 1]])

Please give a solution that is also consistent with

请给出一个也符合的解决方案

>>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,2)),
...                       columns=['A', 'B'])

回答by Andy Hayden

You can use groupby sizeand then unstack:

您可以使用 groupby size然后unstack

In [11]: data.groupby(["A","B"]).size()
Out[11]:
A  B
0  1    2
1  0    2
   1    1
dtype: int64

In [12]: data.groupby(["A","B"]).size().unstack("B")
Out[12]:
B   0  1
A
0 NaN  2
1   2  1

In [13]: data.groupby(["A","B"]).size().unstack("B").fillna(0)
Out[13]:
B  0  1
A
0  0  2
1  2  1

Howeverwhenever you do a groupby followed by an unstack you should think: pivot_table:

但是,每当您执行 groupby 后跟unstack 时,您应该考虑:pivot_table

In [21]: data.pivot_table(index="A", columns="B", aggfunc="count", fill_value=0)
Out[21]:
   C
B  0  1
A
0  0  2
1  2  1

This will be the most efficient solution as well as being the most direct.

这将是最有效的解决方案,也是最直接的。

回答by Anton Protopopov

You could use groupbyon A and B columns and then do counton the result. But with that you'll get only values which you have in your original dataframe. In your case you won't have 0 0counts. After that you could call valuesmethod to get numpyarray:

您可以groupby在 A 和 B 列上使用,然后count对结果进行处理。但是这样一来,您将只能获得原始数据框中的值。在你的情况下,你不会有0 0计数。之后,您可以调用values方法来获取numpy数组:

In [52]: df
Out[52]: 
   A  B  C
0  0  1  0
1  1  0  1
2  1  0  1
3  0  1  1
4  1  1  0

In [56]: df.groupby(['A', 'B'], as_index=False).count()
Out[56]: 
   A  B  C
0  0  1  2
1  1  0  2
2  1  1  1

In [57]: df.groupby(['A', 'B'], as_index=False).count().C.values
Out[57]: array([2, 2, 1])

Then you could use reshapemethod of numpy array

然后你可以使用reshapenumpy数组的方法

For dataframe with all values:

对于具有所有值的数据框:

In [71]: df
Out[71]: 
   A  B  C
0  1  0  1
1  1  1  1
2  1  0  1
3  1  1  0
4  0  1  1
5  0  0  1
6  1  1  1
7  0  0  1
8  0  1  0
9  1  1  0

In [73]: df.groupby(['A', 'B'], as_index=False).count()
Out[73]: 
   A  B  C
0  0  0  2
1  0  1  2
2  1  0  2
3  1  1  4


In [75]: df.groupby(['A', 'B'], as_index=False).count().C.values.reshape(2,2)
Out[75]: 
array([[2, 2],
       [2, 4]])

回答by Alexander

Assuming that all of your data is binary, you can just sum the columns. To be safe, you then use countto get the total of all non null values in the column (the difference between this count and the previous sum is the number of zeros).

假设您的所有数据都是二进制的,您只需对列求和即可。为安全起见,您然后使用count获取列中所有非空值的总和(此计数与前一个总和之间的差是零的数量)。

s = data[['A', 'B']].sum().values
>>> np.matrix([s, data[['A', 'B']].count().values - s])
matrix([[3, 3],
        [2, 2]]

If you are sure that there are no null values, you can save some computational time by just taking the number of rows from the first shape parameter.

如果您确定没有空值,则可以通过仅从第一个形状参数中获取行数来节省一些计算时间。

>>> np.matrix([s, data.shape[0] - s])
matrix([[3, 3],
        [2, 2]]