pandas 计算 DataFrame 每一行中系列中项目的出现次数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24516361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:12:52  来源:igfitidea点击:

Count occurrences of items in Series in each row of a DataFrame

pythonpandasapply

提问by sriramn

I have a pandas.DataFramethat looks like this.

我有一个pandas.DataFrame看起来像这样的。

COL1    COL2    COL3
C1      None    None
C1      C2      None
C1      C1      None
C1      C2      C3

For each row in this dataframe I would like to count the occurrences of each of C1, C2, C3 and append this information as columns to this dataframe. For instance, the first row has 1 C1, 0 C2 and 0 C3. The final data frame should look like this

对于此数据框中的每一行,我想计算 C1、C2、C3 中每一个的出现次数,并将此信息作为列附加到此数据框中。例如,第一行有 1 个 C1、0 个 C2 和 0 个 C3。最终的数据框应该是这样的

COL1    COL2    COL3    C1  C2  C3
C1      None    None    1   0   0
C1      C2      None    1   1   0
C1      C1      None    2   0   0
C1      C2      C3      1   1   1

So, I have created a Series with C1, C2 and C3 as the values - one way top count this is to loop over the rows and columns of the DataFrame and then over this Series and increment the counter if it matches. But is there an applyapproach that can achieve this in a compact fashion?

因此,我创建了一个以 C1、C2 和 C3 作为值的系列 - 一种最高计数的方法是循环遍历 DataFrame 的行和列,然后遍历这个系列,如果匹配则增加计数器。但是有没有一种apply方法可以以紧凑的方式实现这一目标?

回答by Andy Hayden

You could apply value_counts:

你可以申请value_counts

In [11]: df.apply(pd.Series.value_counts, axis=1)
Out[11]: 
   C1  C2  C3  None
0   1 NaN NaN     2
1   1   1 NaN     1
2   2 NaN NaN     1
3   1   1   1   NaN

So you can fill the NaN and applend just the base values you want:

因此,您可以仅填充 NaN 和 applend 所需的基本值:

In [12]: df.apply(pd.Series.value_counts, axis=1)[['C1', 'C2', 'C3']].fillna(0)
Out[12]: 
   C1  C2  C3
0   1   0   0
1   1   1   0
2   2   0   0
3   1   1   1

Note: there's an open issue to have a value_counts method directly for a DataFrame (which I think should be introduced by pandas 0.15).

注意:直接为 DataFrame 使用 value_counts 方法是一个悬而未决的问题(我认为应该由 pandas 0.15 引入)。

回答by Zero

Andy's answer is spot on.

安迪的回答很到位。

I'm adding this answer, if C1,C2...Cn list is huge and we want to view only subset of them.

如果 C1,C2...Cn 列表很大并且我们只想查看它们的子集,我将添加此答案。

dff = df.copy()
dff['C1']=(df == 'C1').T.sum()
dff['C2']=(df == 'C2').T.sum()
dff['C3']=(df == 'C3').T.sum()
dff
  COL1  COL2  COL3  C1  C2  C3
0   C1  None  None   1   0   0
1   C1    C2  None   1   1   0
2   C1    C1  None   2   0   0
3   C1    C2    C3   1   1   1

回答by YOBEN_S

Usually apply+ serisefunction to whole dataframe will slowing down the whole process , Additional Reading : Link

通常对整个数据帧的apply+serise函数会减慢整个过程,补充阅读:链接

df.mask(df.eq('None')).stack().str.get_dummies().sum(level=0)
Out[165]: 
   C1  C2  C3
0   1   0   0
1   1   1   0
2   2   0   0
3   1   1   1


Or you can do with Counter

或者你可以做 Counter

from  collections import Counter

pd.DataFrame([ Counter(x) for x in df.values]).drop('None',1)
Out[170]: 
   C1   C2   C3
0   1  NaN  NaN
1   1  1.0  NaN
2   2  NaN  NaN
3   1  1.0  1.0