Python 计算熊猫出现次数的最有效方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20076195/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
what is the most efficient way of counting occurrences in pandas?
提问by tipanverella
I have a large (about 12M rows) dataframe df with say:
我有一个大的(大约 1200 万行)数据框 df 说:
df.columns = ['word','documents','frequency']
So the following ran in a timely fashion:
因此,以下内容及时运行:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpected long time to run:
但是,这需要很长时间才能运行:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurences in a large dataframe?
我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.
运行得很好,所以我真的没想到这个 Occurrences_of_Words 数据框需要很长时间来构建。
ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.
ps:如果答案很明显,并且你觉得有必要惩罚我提出这个问题,也请附上答案。谢谢你。
采纳答案by Dan Allan
I think df['word'].value_counts()should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why countshould be much slower than max. Both take some time to avoid missing values. (Compare with size.)
我认为df['word'].value_counts()应该服务。通过跳过 groupby 机制,您将节省一些时间。我不确定为什么count应该比max. 两者都需要一些时间来避免缺失值。(比较size。)
In any case, value_counts has been specifically optimizedto handle object type, like your words, so I doubt you'll do much better than that.
无论如何, value_counts 已经专门优化来处理对象类型,比如你的话,所以我怀疑你会做得比这更好。
回答by Dwaraka Uttarkar
回答by user2314737
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False(default is True)
只是对先前答案的补充。我们不要忘记,在处理真实数据时可能会有空值,因此通过使用选项dropna=False(默认为True)将这些值也包括在计数中很有用
An example:
一个例子:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
回答by kztd
I came here just looking to find if "value" was present in df.column, this worked for me:
我来这里只是想找出 df.column 中是否存在“值”,这对我有用:
"value" in df["Column"].values

