Python 计算熊猫出现次数的最有效方法是什么？

Question

提问by tipanverella

I have a large (about 12M rows) dataframe df with say:

我有一个大的（大约 1200 万行）数据框 df 说：

df.columns = ['word','documents','frequency']

So the following ran in a timely fashion:

因此，以下内容及时运行：

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpected long time to run:

但是，这需要很长时间才能运行：

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

我在这里做错了什么？有没有更好的方法来计算大型数据框中的出现次数？

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

运行得很好，所以我真的没想到这个 Occurrences_of_Words 数据框需要很长时间来构建。

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.

ps：如果答案很明显，并且你觉得有必要惩罚我提出这个问题，也请附上答案。谢谢你。

Answer 1

采纳答案by Dan Allan

I think df['word'].value_counts()should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why countshould be much slower than max. Both take some time to avoid missing values. (Compare with size.)

我认为df['word'].value_counts()应该服务。通过跳过 groupby 机制，您将节省一些时间。我不确定为什么count应该比max. 两者都需要一些时间来避免缺失值。（比较size。）

In any case, value_counts has been specifically optimizedto handle object type, like your words, so I doubt you'll do much better than that.

无论如何， value_counts 已经专门优化来处理对象类型，比如你的话，所以我怀疑你会做得比这更好。

Answer 2

回答by Dwaraka Uttarkar

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()

当您想计算 pandas dataFrame 列中分类数据的频率时，请使用： df['Column_Name'].value_counts()

-Source.

-来源。

Answer 3

回答by user2314737

Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False(default is True)

只是对先前答案的补充。我们不要忘记，在处理真实数据时可能会有空值，因此通过使用选项dropna=False（默认为True）将这些值也包括在计数中很有用

An example:

一个例子：

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

Answer 4

回答by kztd

I came here just looking to find if "value" was present in df.column, this worked for me:

我来这里只是想找出 df.column 中是否存在“值”，这对我有用：

"value" in df["Column"].values

Python 计算熊猫出现次数的最有效方法是什么？

提问by tipanverella

采纳答案by Dan Allan

回答by Dwaraka Uttarkar

回答by user2314737

回答by kztd

相关推荐

最近更新

标签

Python 计算熊猫出现次数的最有效方法是什么？

提问by tipanverella

采纳答案by Dan Allan

回答by Dwaraka Uttarkar

回答by user2314737

回答by kztd

相关推荐

Python Flask 的上下文堆栈的目的是什么？

Python 理解 NumPy 的 Convolve

Python 将多个 JSON 记录读入 Pandas 数据帧

Python 熊猫重新索引数据框问题

相关推荐

最近更新

标签