使用 pandas 和 matplotlib 的词频

Question

提问by DevEx

How can I plot word frequency histogram (for author column)using pandas and matplotlib from a csv file? My csv is like: id, author, title, language Sometimes I have more than one authors in author column separated by space

如何使用 csv 文件中的 pandas 和 matplotlib 绘制词频直方图（用于作者列）？我的 csv 是这样的：id、author、title、language 有时我在 author 列中有多个作者，用空格分隔

file = 'c:/books.csv'
sheet = open(file)
df = read_csv(sheet)
print df['author']

Answer 1

回答by Dr. Jan-Philip Gehrcke

Use collections.Counterfor creating the histogram data, and follow the example given here, i.e.:

使用collections.Counter用于创建直方图数据，并按照给定的例子在这里，即：

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read CSV file, get author names and counts.
df = pd.read_csv("books.csv", index_col="id")
counter = Counter(df['author'])
author_names = counter.keys()
author_counts = counter.values()

# Plot histogram using matplotlib bar().
indexes = np.arange(len(author_names))
width = 0.7
plt.bar(indexes, author_counts, width)
plt.xticks(indexes + width * 0.5, author_names)
plt.show()

With this test file:

有了这个测试文件：

$ cat books.csv 
id,author,title,language
1,peter,t1,de
2,peter,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

the code above creates the following graph:

上面的代码创建了下图：

enter image description here

在此处输入图片说明

Edit:

编辑：

You added a secondary condition, where the author column might contain multiple space-separated names. The following code handles this:

您添加了一个次要条件，其中作者列可能包含多个以空格分隔的名称。下面的代码处理这个：

from itertools import chain

# Read CSV file, get 
df = pd.read_csv("books2.csv", index_col="id")
authors_notflat = [a.split() for a in df['author']]
counter = Counter(chain.from_iterable(authors_notflat))
print counter

For this example:

对于这个例子：

$ cat books2.csv 
id,author,title,language
1,peter harald,t1,de
2,peter harald,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

it prints

它打印

$ python test.py 
Counter({'peter': 3, 'bob': 2, 'harald': 2, 'marianne': 1})

Note that this code only works because strings are iterable.

请注意，此代码仅适用于字符串是可迭代的。

This code is essentially free of pandas, except for the CSV-parsing part that led the DataFrame df. If you need the default plot styling of pandas, then there also is a suggestion in the mentionedthread.

除了引导 DataFrame 的 CSV 解析部分之外，此代码基本上没有 Pandas df。如果您需要Pandas的默认绘图样式，那么在提到的线程中也有一个建议。

Answer 2

回答by Andy Hayden

You can count up the number of occurrences of each name using value_counts:

您可以使用以下命令计算每个名称出现的次数value_counts：

In [11]: df['author'].value_counts()
Out[11]: 
peter       3
bob         2
marianne    1
dtype: int64

Series (and DataFrames) have a histmethod for drawing histograms:

Series（和 DataFrames）有一个hist方法来绘制直方图：

In [12]: df['author'].value_counts().hist()

使用 pandas 和 matplotlib 的词频

提问by DevEx

回答by Dr. Jan-Philip Gehrcke

回答by Andy Hayden

相关推荐

最近更新

标签

使用 pandas 和 matplotlib 的词频

提问by DevEx

回答by Dr. Jan-Philip Gehrcke

回答by Andy Hayden

相关推荐

在 Pandas 中用平均值转换组的更快方法

使用 Pandas 中的数据透视表加权平均值

pandas 如何一次将函数应用于熊猫数据框中的多列

pandas 如何从另一个数据框中用一行减去数据框中的所有行？

相关推荐

最近更新

标签