pandas 熊猫数据框中最流行的单词数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40206249/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:16:36  来源:igfitidea点击:

Count of most popular words in a pandas Dataframe

python-3.xcsvpandasdataframe

提问by MaxU

I use a csv data file containing movie data. In this dataset there is a column named plot_keywords.I want to find the 10 or 20 most popular keywords ,the number of times they show up and plotting them in a bar chart.To be more specific i copied 2 instances as they show up when i print the dataframe

我使用包含电影数据的 csv 数据文件。在这个数据集中有一列名为 plot_keywords。我想找到 10 或 20 个最流行的关键字,它们出现的次数并将它们绘制在条形图中。更具体地说,我复制了 2 个实例,因为它们出现的时间我打印数据框

9 blood|book|love|potion|professor

9血|书|爱情|药水|教授

18 blackbeard|captain|pirate|revenge|soldier

18 黑胡子|船长|海盗|复仇|士兵

I open the csv file as a pandas DataFrame.Here is the code i have so far

我将 csv 文件作为 Pandas DataFrame 打开。这是我到目前为止的代码

import pandas as pd
data=pd.read_csv('data.csv')
pd.Series(' '.join(data['plot_keywords']).lower().split()).value_counts()[:10]

None of other posts have helped me so far Thanks in advance

到目前为止,没有其他帖子对我有帮助,在此先感谢

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset/kernels

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset/kernels

回答by MaxU

Here is a NLTK solution, which ignores English stopwords (for example: in, on, of, the, etc.):

这里是一个NLTK溶液,而忽略英语停用词(例如:inonofthe,等等):

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

txt = df.plot_keywords.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

print('All frequencies, including STOPWORDS:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)

rslt = pd.DataFrame(words_except_stop_dist.most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')

matplotlib.style.use('ggplot')

rslt.plot.bar(rot=0)

Output:

输出:

All frequencies, including STOPWORDS:
============================================================
     Word  Frequency
0      in        339
1  female        301
2   title        289
3  nudity        259
4    love        248
5      on        240
6  school        238
7  friend        228
8      of        222
9     the        212
============================================================

enter image description here

在此处输入图片说明

Pandas solution, which uses stopwords from NLTK module:

Pandas 解决方案,它使用来自 NLTK 模块的停用词:

from collections import Counter
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.plot_keywords
           .str.lower()
           .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
           .str.cat(sep=' ')
           .split()
)

# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(rslt)

# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)

Output:

输出:

        Frequency
Word
female        301
title         289
nudity        259
love          248
school        238
friend        228
police        210
male          205
death         195
sex           192