Python pandas 数据框对象是否可以与 sklearn kmeans 聚类一起使用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28017091/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:37:12  来源:igfitidea点击:

Will pandas dataframe object work with sklearn kmeans clustering?

pythonpandasscikit-learncluster-analysisk-means

提问by Dark Knight

dataset is pandas dataframe. This is sklearn.cluster.KMeans

数据集是熊猫数据框。这是 sklearn.cluster.KMeans

 km = KMeans(n_clusters = n_Clusters)

 km.fit(dataset)

 prediction = km.predict(dataset)

This is how I decide which entity belongs to which cluster:

这就是我决定哪个实体属于哪个集群的方式:

 for i in range(len(prediction)):
     cluster_fit_dict[dataset.index[i]] = prediction[i]

This is how dataset looks:

这是数据集的外观:

 A 1 2 3 4 5 6
 B 2 3 4 5 6 7
 C 1 4 2 7 8 1
 ...

where A,B,C are indices

其中 A、B、C 是索引

Is this the correct way of using k-means?

这是使用 k-means 的正确方法吗?

采纳答案by ogrisel

To know if your dataframe datasethas suitable content you can explicitly convert to a numpy array:

要知道您的数据框dataset是否有合适的内容,您可以显式转换为 numpy 数组:

dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)

If the array has an homogeneous numerical dtype(typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScalerfor instance.

如果数组具有同构数值dtype(通常为numpy.float64),那么对于 scikit-learn 0.15.2 及更高版本应该没问题。例如,您可能仍然需要对数据进行规范化sklearn.preprocessing.StandardScaler

If your data frame is heterogeneously typed, the dtypeof the corresponding numpy array will be objectwhich is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

如果您的数据框是异构类型的dtype,则对应的 numpy 数组的 将object不适合 scikit-learn。您需要为所有相关特征提取数字表示(例如通过提取分类特征的虚拟变量)并删除不适合特征的列(例如样本标识符)。

回答by user666

Assuming all the values in the dataframe are numeric,

假设数据框中的所有值都是数字,

# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T

Alternatively, you could try KMeans++ for Pandas.

或者,您可以尝试KMeans++ for Pandas