Python 绘制文档 tfidf 二维图

Question

提问by jxn

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph. I am trying to get a plot to see how well my sentences can be classified using kmeans.

我想为我的句子列表绘制一个二维图，其中 x 轴作为术语，y 轴作为 TFIDF 分数（或文档 ID）。我使用 scikit learn 的 fit_transform() 来获取 scipy 矩阵，但我不知道如何使用该矩阵来绘制图形。我正在尝试绘制一个图，以查看使用 kmeans 对我的句子进行分类的程度。

Here is the output of fit_transform(sentence_list):

这是输出fit_transform(sentence_list)：

(document id, term number) tfidf score

（文档 ID，术语编号）tfidf 分数

(0, 1023)   0.209291711271
(0, 924)    0.174405532933
(0, 914)    0.174405532933
(0, 821)    0.15579574484
(0, 770)    0.174405532933
(0, 763)    0.159719994016
(0, 689)    0.135518787598

Here is my code:

这是我的代码：

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_   # Returns a list of clusters ranging 0 to 10

Thanks,

谢谢，

Answer 1

采纳答案by elyase

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

当您使用 Bag of Words 时，您的每个句子都会在长度等于词汇量的高维空间中表示。如果要在 2D 中表示它，则需要降低维度，例如使用具有两个分量的 PCA：

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])        
X = pipeline.fit_transform(newsgroups_train.data).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show()              #not required if using ipython notebook

data2d

数据2d

Now you can for example calculate and plot the cluster enters on this data:

例如，现在您可以计算并绘制集群在此数据上的输入：

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)

plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c='r')
plt.show()              #not required if using ipython notebook

enter image description here

在此处输入图片说明

Answer 2

回答by beto

Just assign a variable to the labels and use that to denote color. ex km = Kmeans().fit(X) clusters = km.labels_.tolist()then c=clusters

只需为标签分配一个变量并使用它来表示颜色。前 km = Kmeans().fit(X) clusters = km.labels_.tolist()然后c=clusters

Python 绘制文档 tfidf 二维图

提问by jxn

采纳答案by elyase

回答by beto

相关推荐

最近更新

标签

Python 绘制文档 tfidf 二维图

提问by jxn

采纳答案by elyase

回答by beto

相关推荐

Python 在逐行迭代时更新 Pandas 中的数据帧

Python AttributeError: 'module' 对象没有属性 'setdefaultencoding'

Python awscli 安装后未添加到路径

如何在 Python 3.4 上安装 PyGame？

相关推荐

最近更新

标签