pandas 使用 sklearn 在 3 维上进行 K 均值聚类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44783360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:53:41  来源:igfitidea点击:

K-means clustering on 3 dimensions with sklearn

pythonpandasmultidimensional-arrayscikit-learnk-means

提问by P Gresh

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found herehas been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the ilocbit of this line:

我正在尝试使用纬度/经度作为 X/Y 轴和 DaysUntilDueDate 作为我的 Z 轴对数据进行聚类。我还想保留索引列 ('PM'),以便我稍后可以使用此聚类分析创建时间表。我在这里找到的教程非常棒,但我不知道它是否考虑了 Z 轴,而且我的探索除了错误之外没有任何结果。我认为代码中的要点是iloc这一行的位的参数:

kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])

I tried changing this part to iloc[1:4](to only work on columns 1-3) but that resulted in the following error:

我尝试将此部分更改为iloc[1:4](仅适用于第 1-3 列),但这导致以下错误:

ValueError: n_samples=3 should be >= n_clusters=4

So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?

所以我的问题是:如何设置我的代码以在保留索引 ('PM') 列的同时对 3 维运行聚类分析?

Here's my python file, thanks for your help:

这是我的python文件,感谢您的帮助:

from sklearn.cluster import KMeans
import csv
import pandas as pd

# Import csv file with data in following columns:
#    [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]

df = pd.read_csv('point_data_test.csv',index_col=['PM'])

numProjects = len(df)
K = numProjects // 3    # Around three projects can be worked per day


print("Number of projects: ", numProjects)
print("K-clusters: ", K)

for k in range(1, K):
    # Create a kmeans model on our data, using k clusters.
    #   Random_state helps ensure that the algorithm returns the
    #   same results each time.
    kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])

    # These are our fitted labels for clusters --
    #   the first cluster has label 0, and the second has label 1.
    labels = kmeans_model.labels_

    # Sum of distances of samples to their closest cluster center
    SSE = kmeans_model.inertia_

print("k:",k, " SSE:", SSE)

# Add labels to df
df['Labels'] = labels
#print(df)

df.to_csv('test_KMeans_out.csv')

回答by Grr

It seems the issue is with the syntax of iloc[1:4].

问题似乎出在iloc[1:4].

From your question it appears you changed:

从你的问题看来你改变了:

kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])

to:

到:

kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])

It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.

在我看来,要么您有错别字,要么您不了解 iloc 的工作原理。所以我会解释。

You should start by reading Indexing and Selecting Data from the pandas documentation.

您应该首先阅读 pandas 文档中的索引和选择数据。

But in short .ilocis an integer based indexing method for selecting data by position.

但简而言之,它.iloc是一种基于整数的索引方法,用于按位置选择数据。

Let's say you have the dataframe:

假设您有数据框:

 A    B    C
 1    2    3
 4    5    6
 7    8    9
10   11   12  

The use of iloc in the example you provided iloc[:,:]selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notationor the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4]selects the rows at index 1-3. This would result in:

在您提供的示例中使用 ilociloc[:,:]选择所有行和列并生成整个数据帧。如果您不熟悉 Python 的切片符号,请查看问题解释切片符号Python 非正式介绍的文档。您所说的导致错误的示例iloc[1:4]选择了索引 1-3 处的行。这将导致:

 A    B    C
 4    5    6
 7    8    9
10   11   12 

Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeansto find 4 clusters, which just isn't possible.

现在,如果您考虑一下您正在尝试执行的操作以及您收到的错误,您将意识到您从数据中选择的样本比您正在寻找的集群要少。3 个样本(第 1、2、3 行),但您要KMeans找到 4 个集群,这是不可能的。

What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:

您真正想要做的(据我所知)是选择与您的 lat、lng 和 z 值相对应的所有行和列 1-3。为此,只需添加一个冒号作为 iloc 的第一个参数,如下所示:

df.iloc[:, 1:4]

Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeansshould work as you intended.

现在,您将选择所有样本以及索引 1、2 和 3 处的列。现在,假设您有足够的样本,KMeans应该会按预期工作。