Python 绘制文档 tfidf 二维图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28160335/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:49:28  来源:igfitidea点击:

plot a document tfidf 2D graph

pythonnumpyscipyscikit-learnk-means

提问by jxn

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph. I am trying to get a plot to see how well my sentences can be classified using kmeans.

我想为我的句子列表绘制一个二维图,其中 x 轴作为术语,y 轴作为 TFIDF 分数(或文档 ID)。我使用 scikit learn 的 fit_transform() 来获取 scipy 矩阵,但我不知道如何使用该矩阵来绘制图形。我正在尝试绘制一个图,以查看使用 kmeans 对我的句子进行分类的程度。

Here is the output of fit_transform(sentence_list):

这是输出fit_transform(sentence_list)

(document id, term number) tfidf score

(文档 ID,术语编号)tfidf 分数

(0, 1023)   0.209291711271
(0, 924)    0.174405532933
(0, 914)    0.174405532933
(0, 821)    0.15579574484
(0, 770)    0.174405532933
(0, 763)    0.159719994016
(0, 689)    0.135518787598

Here is my code:

这是我的代码:

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_   # Returns a list of clusters ranging 0 to 10 

Thanks,

谢谢,

采纳答案by elyase

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

当您使用 Bag of Words 时,您的每个句子都会在长度等于词汇量的高维空间中表示。如果要在 2D 中表示它,则需要降低维度,例如使用具有两个分量的 PCA:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])        
X = pipeline.fit_transform(newsgroups_train.data).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show()              #not required if using ipython notebook

data2d

数据2d

Now you can for example calculate and plot the cluster enters on this data:

例如,现在您可以计算并绘制集群在此数据上的输入:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)

plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c='r')
plt.show()              #not required if using ipython notebook

enter image description here

在此处输入图片说明

回答by beto

Just assign a variable to the labels and use that to denote color. ex km = Kmeans().fit(X) clusters = km.labels_.tolist()then c=clusters

只需为标签分配一个变量并使用它来表示颜色。前 km = Kmeans().fit(X) clusters = km.labels_.tolist()然后c=clusters