从 Dataframe Pandas 中的句子中计算最常见的 100 个单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29903025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:16:24  来源:igfitidea点击:

Count most frequent 100 words from sentences in Dataframe Pandas

pythonpandas

提问by swati saoji

I have text reviews in one column in Pandas dataframe and I want to count the N-most frequent words with their frequency counts (in whole column - NOT in single cell). One approach is Counting the words using a counter, by iterating through each row. Is there a better alternative?

我在 Pandas 数据框中的一列中有文本评论,我想用频率计数来计算 N 个最频繁的单词(在整列中 - 而不是在单个单元格中)。一种方法是使用计数器通过迭代每一行来计算单词。有更好的选择吗?

Representative data.

代表性数据。

0    a heartening tale of small victories and endu
1    no sophomore slump for director sam mendes  w
2    if you are an actor who can relate to the sea
3    it's this memory-as-identity obviation that g
4    boyd's screenplay ( co-written with guardian

回答by Joran Beasley

from collections import Counter
Counter(" ".join(df["text"]).split()).most_common(100)

im pretty sure would give you what you want (you might have to remove some non-words from the counter result before calling most_common)

我很确定会给你你想要的(你可能需要在调用 most_common 之前从计数器结果中删除一些非单词)

回答by Zero

Along with @Joran's solution you could also you use series.value_countsfor large amounts of text/rows

与@Joran 的解决方案一起,您还可以series.value_counts用于大量文本/行

 pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

You would find from the benchmarks series.value_countsseems twice (2X) faster than Countermethod

您会从基准测试中发现series.value_counts似乎比Counter方法快两倍 (2X)

For Movie Reviews dataset of 3000 rows, totaling 400K characters and 70k words.

对于 3000 行的电影评论数据集,总共 40 万个字符和 7 万个单词。

In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop

In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop