pandas 计算 DataFrame 每一行中系列中项目的出现次数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24516361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count occurrences of items in Series in each row of a DataFrame
提问by sriramn
I have a pandas.DataFramethat looks like this.
我有一个pandas.DataFrame看起来像这样的。
COL1 COL2 COL3
C1 None None
C1 C2 None
C1 C1 None
C1 C2 C3
For each row in this dataframe I would like to count the occurrences of each of C1, C2, C3 and append this information as columns to this dataframe. For instance, the first row has 1 C1, 0 C2 and 0 C3. The final data frame should look like this
对于此数据框中的每一行,我想计算 C1、C2、C3 中每一个的出现次数,并将此信息作为列附加到此数据框中。例如,第一行有 1 个 C1、0 个 C2 和 0 个 C3。最终的数据框应该是这样的
COL1 COL2 COL3 C1 C2 C3
C1 None None 1 0 0
C1 C2 None 1 1 0
C1 C1 None 2 0 0
C1 C2 C3 1 1 1
So, I have created a Series with C1, C2 and C3 as the values - one way top count this is to loop over the rows and columns of the DataFrame and then over this Series and increment the counter if it matches. But is there an applyapproach that can achieve this in a compact fashion?
因此,我创建了一个以 C1、C2 和 C3 作为值的系列 - 一种最高计数的方法是循环遍历 DataFrame 的行和列,然后遍历这个系列,如果匹配则增加计数器。但是有没有一种apply方法可以以紧凑的方式实现这一目标?
回答by Andy Hayden
You could apply value_counts:
你可以申请value_counts:
In [11]: df.apply(pd.Series.value_counts, axis=1)
Out[11]:
C1 C2 C3 None
0 1 NaN NaN 2
1 1 1 NaN 1
2 2 NaN NaN 1
3 1 1 1 NaN
So you can fill the NaN and applend just the base values you want:
因此,您可以仅填充 NaN 和 applend 所需的基本值:
In [12]: df.apply(pd.Series.value_counts, axis=1)[['C1', 'C2', 'C3']].fillna(0)
Out[12]:
C1 C2 C3
0 1 0 0
1 1 1 0
2 2 0 0
3 1 1 1
Note: there's an open issue to have a value_counts method directly for a DataFrame (which I think should be introduced by pandas 0.15).
注意:直接为 DataFrame 使用 value_counts 方法是一个悬而未决的问题(我认为应该由 pandas 0.15 引入)。
回答by Zero
Andy's answer is spot on.
安迪的回答很到位。
I'm adding this answer, if C1,C2...Cn list is huge and we want to view only subset of them.
如果 C1,C2...Cn 列表很大并且我们只想查看它们的子集,我将添加此答案。
dff = df.copy()
dff['C1']=(df == 'C1').T.sum()
dff['C2']=(df == 'C2').T.sum()
dff['C3']=(df == 'C3').T.sum()
dff
COL1 COL2 COL3 C1 C2 C3
0 C1 None None 1 0 0
1 C1 C2 None 1 1 0
2 C1 C1 None 2 0 0
3 C1 C2 C3 1 1 1
回答by YOBEN_S
Usually apply+ serisefunction to whole dataframe will slowing down the whole process , Additional Reading : Link
通常对整个数据帧的apply+serise函数会减慢整个过程,补充阅读:链接
df.mask(df.eq('None')).stack().str.get_dummies().sum(level=0)
Out[165]:
C1 C2 C3
0 1 0 0
1 1 1 0
2 2 0 0
3 1 1 1
Or you can do with Counter
或者你可以做 Counter
from collections import Counter
pd.DataFrame([ Counter(x) for x in df.values]).drop('None',1)
Out[170]:
C1 C2 C3
0 1 NaN NaN
1 1 1.0 NaN
2 2 NaN NaN
3 1 1.0 1.0

