在 Python 中使用 scikit-learn kmeans 聚类文本文档

Question

提问by Nabila Shahid

I need to implement scikit-learn's kMeansfor clustering text documents. The example codeworks fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below:

我需要实现scikit-learn 的 kMeans来聚类文本文档。该示例代码工作正常，因为它只是需要一些20newsgroups数据作为输入。我想使用相同的代码来聚类文档列表，如下所示：

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

What changes do i need to do in kMeans example codeto use this list as input? (Simply taking 'dataset = documents' doesn't work)

我需要在kMeans 示例代码中做哪些更改才能将此列表用作输入？（简单地采用“数据集=文档”是行不通的）

Answer 1

采纳答案by elyase

This is a simpler example:

这是一个更简单的例子：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

vectorize the text i.e. convert the strings to numeric features

向量化文本，即将字符串转换为数字特征

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

cluster documents

集群文件

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print top terms per cluster clusters

打印每个集群集群的顶级术语

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

If you want to have a more visual idea of how this looks like see this answer.

如果您想更直观地了解它的外观，请参阅此答案。

Answer 2

回答by Kathirmani Sukumar

Found this article to be very useful for document clustering using K-Means. http://brandonrose.org/clustering.

发现这篇文章对于使用K-Means. http://brandonrose.org/clustering。

For understanding the algorithm, you can checkout this article as well https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

为了理解算法，您也可以查看这篇文章https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

在 Python 中使用 scikit-learn kmeans 聚类文本文档

提问by Nabila Shahid

采纳答案by elyase

vectorize the text i.e. convert the strings to numeric features

向量化文本，即将字符串转换为数字特征

cluster documents

集群文件

print top terms per cluster clusters

打印每个集群集群的顶级术语

回答by Kathirmani Sukumar

相关推荐

最近更新

标签

在 Python 中使用 scikit-learn kmeans 聚类文本文档

提问by Nabila Shahid

采纳答案by elyase

vectorize the text i.e. convert the strings to numeric features

向量化文本，即将字符串转换为数字特征

cluster documents

集群文件

print top terms per cluster clusters

打印每个集群集群的顶级术语

回答by Kathirmani Sukumar

相关推荐

Python从xml中提取数据并保存到excel

Python Mock 多次调用，结果不同

Python pip3 的权限错误

Python找不到文件

相关推荐

最近更新

标签