pandas 熊猫数据框中最流行的单词数

Question

提问by MaxU

I use a csv data file containing movie data. In this dataset there is a column named plot_keywords.I want to find the 10 or 20 most popular keywords ,the number of times they show up and plotting them in a bar chart.To be more specific i copied 2 instances as they show up when i print the dataframe

我使用包含电影数据的 csv 数据文件。在这个数据集中有一列名为 plot_keywords。我想找到 10 或 20 个最流行的关键字，它们出现的次数并将它们绘制在条形图中。更具体地说，我复制了 2 个实例，因为它们出现的时间我打印数据框

9 blood|book|love|potion|professor

9血|书|爱情|药水|教授

18 blackbeard|captain|pirate|revenge|soldier

18 黑胡子|船长|海盗|复仇|士兵

I open the csv file as a pandas DataFrame.Here is the code i have so far

我将 csv 文件作为 Pandas DataFrame 打开。这是我到目前为止的代码

import pandas as pd
data=pd.read_csv('data.csv')
pd.Series(' '.join(data['plot_keywords']).lower().split()).value_counts()[:10]

None of other posts have helped me so far Thanks in advance

到目前为止，没有其他帖子对我有帮助，在此先感谢

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset/kernels

Answer 1

回答by MaxU

Here is a NLTK solution, which ignores English stopwords (for example: in, on, of, the, etc.):

这里是一个NLTK溶液，而忽略英语停用词（例如：in，on，of，the，等等）：

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

txt = df.plot_keywords.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

print('All frequencies, including STOPWORDS:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)

rslt = pd.DataFrame(words_except_stop_dist.most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')

matplotlib.style.use('ggplot')

rslt.plot.bar(rot=0)

Output:

输出：

All frequencies, including STOPWORDS:
============================================================
     Word  Frequency
0      in        339
1  female        301
2   title        289
3  nudity        259
4    love        248
5      on        240
6  school        238
7  friend        228
8      of        222
9     the        212
============================================================

Pandas solution, which uses stopwords from NLTK module:

Pandas 解决方案，它使用来自 NLTK 模块的停用词：

from collections import Counter
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
                 usecols=['movie_title','plot_keywords'])

stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.plot_keywords
           .str.lower()
           .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
           .str.cat(sep=' ')
           .split()
)

# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(rslt)

# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)

Output:

输出：

        Frequency
Word
female        301
title         289
nudity        259
love          248
school        238
friend        228
police        210
male          205
death         195
sex           192

pandas 熊猫数据框中最流行的单词数

提问by MaxU

回答by MaxU

相关推荐

最近更新

标签

pandas 熊猫数据框中最流行的单词数

提问by MaxU

回答by MaxU

相关推荐

如何根据来自多列的数据在 Pandas Python 中的一个图中绘制多条线？

pandas 用字典替换熊猫系列中的值

如何在 Pandas Python 中从最大到最小的 groupby 数据排序

Pandas/Python 根据条件添加行

相关推荐

最近更新

标签