Python Pandas:计算数据框中的唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21633580/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Counting unique values in a dataframe
提问by jeffalstott
We have a DataFrame that looks like this:
我们有一个如下所示的 DataFrame:
> df.ix[:2,:10]
0 1 2 3 4 5 6 7 8 9 10
0 NaN NaN NaN NaN 6 5 NaN NaN 4 NaN 5
1 NaN NaN NaN NaN 8 NaN NaN 7 NaN NaN 5
2 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN
We simply want the counts of all unique values in the DataFrame. A simple solution is:
我们只需要 DataFrame 中所有唯一值的计数。一个简单的解决方案是:
df.stack().value_counts()
However:
1. It looks like stackreturns a copy, not a view, which is memory prohibitive in this case. Is this correct?
2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with stackand use it for now, how does one do the grouping correctly?
但是: 1. 它看起来像stack返回一个副本,而不是一个视图,在这种情况下这是内存禁止的。这样对吗?2.我想按行对DataFrame进行分组,然后获取每个分组的不同直方图。如果我们stack暂时忽略内存问题并使用它,如何正确进行分组?
d = pd.DataFrame([[nan, 1, nan, 2, 3],
[nan, 1, 1, 1, 3],
[nan, 1, nan, 2, 3],
[nan,2,2,2, 3]])
len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length
The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns, because the nans are removed.
堆叠的 DataFrame 有一个 MultiIndex,长度小于n_rows*n_columns,因为nan删除了s。
0 1 1
3 2
4 3
1 0 1
1 1
2 1
3 1
4 3
....
This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want.
这意味着我们不容易知道如何构建我们的分组。只在第一级操作会好得多,但随后我陷入了如何应用我真正想要的分组的问题上。
d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'
Edit: A solution, which doesn't use stacking:
编辑:一个不使用堆叠的解决方案:
f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a 1 4
3 2
2 1
b 2 4
3 2
1 1
dtype: int64
Looks clunky, though. If there's a better option I'm happy to hear it.
不过看起来笨重。如果有更好的选择,我很高兴听到。
Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.
编辑:Dan 的评论显示我有一个错字,尽管纠正它仍然没有让我们到达终点线。
采纳答案by Andy Hayden
I think you are doing a row/column-wise operation so can use apply:
我认为您正在执行行/列操作,因此可以使用apply:
In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0)
Out[11]:
1 2 3
0 1 1 1
1 4 0 1
2 1 1 1
3 0 4 1
Note: There is a value_countsDataFrame method in the works for 0.14... which will make this more efficient and more concise.
注意:在value_counts0.14 的作品中有一个DataFrame 方法......这将使这更有效和更简洁。
It's worth noting that the pandas value_countsfunction also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array viewusing np.ravel):
值得注意的是,熊猫value_counts也起工作的numpy的阵列上,这样就可以把它传递数据帧的值(作为1-d阵列视图使用np.ravel):
In [21]: pd.value_counts(d.values.ravel())
Out[21]:
2 6
1 6
3 4
dtype: int64
Also, you were pretty close to getting this correct, but you'd need to stack and unstack:
此外,您已经非常接近正确,但您需要堆叠和取消堆叠:
In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0)
Out[22]:
1 2 3
0 1 1 1
1 4 0 1
2 1 1 1
3 0 4 1
This error seems somewhat self explanatory (4 != 16):
这个错误似乎有些不言自明(4 != 16):
len(d.stack()) #16
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length
perhaps you wanted to pass:
也许你想通过:
In [23]: np.repeat(np.arange(4), 4)
Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
回答by tegan
Not enough rep to comment, but Andy's answer:
没有足够的代表发表评论,但安迪的回答是:
pd.value_counts(d.values.ravel())
is what I have used personally, and seems to me to be by far the most versatile and easily-readable solution. Another advantage is that it is easy to use a subset of the columns:
是我个人使用过的,在我看来是迄今为止最通用且易于阅读的解决方案。另一个优点是很容易使用列的子集:
pd.value_counts(d[[1,3,4,6,7]].values.ravel())
or
或者
pd.value_counts(d[["col_title1","col_title2"]].values.ravel())
Is there any disadvantage to this approach, or any particular reason you want to use stack and groupby?
这种方法有什么缺点,或者你想使用 stack 和 groupby 有什么特别的原因吗?

