pandas 获取熊猫数据框中所有唯一行的计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34255882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get count of all unique rows in pandas dataframe
提问by Yashu Seth
I have a Pandas DataFrame -
我有一个 Pandas DataFrame -
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,3)),
... columns=['A', 'B', 'C'])
>>> data
A B C
0 0 1 0
1 1 0 1
2 1 0 1
3 0 1 1
4 1 1 0
Now I use this to get the count of rows only for column A
现在我用它来获取 A 列的行数
>>> data.ix[:, 'A'].value_counts()
1 3
0 2
dtype: int64
What is the most efficient way to get the count of rows for column A and B i.e something like the following output -
获取 A 列和 B 列的行数的最有效方法是什么,即类似于以下输出 -
0 0 0
0 1 2
1 0 2
1 1 1
And then finally how can I convert it into a numpy array such as -
然后最后我怎么能把它转换成一个 numpy 数组,比如 -
array([[0, 2],
[2, 1]])
Please give a solution that is also consistent with
请给出一个也符合的解决方案
>>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,2)),
... columns=['A', 'B'])
回答by Andy Hayden
You can use groupby sizeand then unstack:
In [11]: data.groupby(["A","B"]).size()
Out[11]:
A B
0 1 2
1 0 2
1 1
dtype: int64
In [12]: data.groupby(["A","B"]).size().unstack("B")
Out[12]:
B 0 1
A
0 NaN 2
1 2 1
In [13]: data.groupby(["A","B"]).size().unstack("B").fillna(0)
Out[13]:
B 0 1
A
0 0 2
1 2 1
Howeverwhenever you do a groupby followed by an unstack you should think: pivot_table:
但是,每当您执行 groupby 后跟unstack 时,您应该考虑:pivot_table:
In [21]: data.pivot_table(index="A", columns="B", aggfunc="count", fill_value=0)
Out[21]:
C
B 0 1
A
0 0 2
1 2 1
This will be the most efficient solution as well as being the most direct.
这将是最有效的解决方案,也是最直接的。
回答by Anton Protopopov
You could use groupby
on A and B columns and then do count
on the result. But with that you'll get only values which you have in your original dataframe. In your case you won't have 0 0
counts. After that you could call values
method to get numpy
array:
您可以groupby
在 A 和 B 列上使用,然后count
对结果进行处理。但是这样一来,您将只能获得原始数据框中的值。在你的情况下,你不会有0 0
计数。之后,您可以调用values
方法来获取numpy
数组:
In [52]: df
Out[52]:
A B C
0 0 1 0
1 1 0 1
2 1 0 1
3 0 1 1
4 1 1 0
In [56]: df.groupby(['A', 'B'], as_index=False).count()
Out[56]:
A B C
0 0 1 2
1 1 0 2
2 1 1 1
In [57]: df.groupby(['A', 'B'], as_index=False).count().C.values
Out[57]: array([2, 2, 1])
Then you could use reshape
method of numpy array
然后你可以使用reshape
numpy数组的方法
For dataframe with all values:
对于具有所有值的数据框:
In [71]: df
Out[71]:
A B C
0 1 0 1
1 1 1 1
2 1 0 1
3 1 1 0
4 0 1 1
5 0 0 1
6 1 1 1
7 0 0 1
8 0 1 0
9 1 1 0
In [73]: df.groupby(['A', 'B'], as_index=False).count()
Out[73]:
A B C
0 0 0 2
1 0 1 2
2 1 0 2
3 1 1 4
In [75]: df.groupby(['A', 'B'], as_index=False).count().C.values.reshape(2,2)
Out[75]:
array([[2, 2],
[2, 4]])
回答by Alexander
Assuming that all of your data is binary, you can just sum the columns. To be safe, you then use count
to get the total of all non null values in the column (the difference between this count and the previous sum is the number of zeros).
假设您的所有数据都是二进制的,您只需对列求和即可。为安全起见,您然后使用count
获取列中所有非空值的总和(此计数与前一个总和之间的差是零的数量)。
s = data[['A', 'B']].sum().values
>>> np.matrix([s, data[['A', 'B']].count().values - s])
matrix([[3, 3],
[2, 2]]
If you are sure that there are no null values, you can save some computational time by just taking the number of rows from the first shape parameter.
如果您确定没有空值,则可以通过仅从第一个形状参数中获取行数来节省一些计算时间。
>>> np.matrix([s, data.shape[0] - s])
matrix([[3, 3],
[2, 2]]