Python pandas 数据框对象是否可以与 sklearn kmeans 聚类一起使用？

Question

提问by Dark Knight

dataset is pandas dataframe. This is sklearn.cluster.KMeans

数据集是熊猫数据框。这是 sklearn.cluster.KMeans

 km = KMeans(n_clusters = n_Clusters)

 km.fit(dataset)

 prediction = km.predict(dataset)

This is how I decide which entity belongs to which cluster:

这就是我决定哪个实体属于哪个集群的方式：

 for i in range(len(prediction)):
     cluster_fit_dict[dataset.index[i]] = prediction[i]

This is how dataset looks:

这是数据集的外观：

 A 1 2 3 4 5 6
 B 2 3 4 5 6 7
 C 1 4 2 7 8 1
 ...

where A,B,C are indices

其中 A、B、C 是索引

Is this the correct way of using k-means?

这是使用 k-means 的正确方法吗？

Answer 1

采纳答案by ogrisel

To know if your dataframe datasethas suitable content you can explicitly convert to a numpy array:

要知道您的数据框dataset是否有合适的内容，您可以显式转换为 numpy 数组：

dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)

If the array has an homogeneous numerical dtype(typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScalerfor instance.

如果数组具有同构数值dtype（通常为numpy.float64），那么对于 scikit-learn 0.15.2 及更高版本应该没问题。例如，您可能仍然需要对数据进行规范化sklearn.preprocessing.StandardScaler。

If your data frame is heterogeneously typed, the dtypeof the corresponding numpy array will be objectwhich is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

如果您的数据框是异构类型的dtype，则对应的 numpy 数组的将object不适合 scikit-learn。您需要为所有相关特征提取数字表示（例如通过提取分类特征的虚拟变量）并删除不适合特征的列（例如样本标识符）。

Answer 2

回答by user666

Assuming all the values in the dataframe are numeric,

假设数据框中的所有值都是数字，

# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T

Alternatively, you could try KMeans++ for Pandas.

或者，您可以尝试KMeans++ for Pandas。

Python pandas 数据框对象是否可以与 sklearn kmeans 聚类一起使用？

提问by Dark Knight

采纳答案by ogrisel

回答by user666

相关推荐

最近更新

标签

Python pandas 数据框对象是否可以与 sklearn kmeans 聚类一起使用？

提问by Dark Knight

采纳答案by ogrisel

回答by user666

相关推荐

抑制 InsecureRequestWarning：在 Python2.6 中发出未经验证的 HTTPS 请求

如何将 None 值附加到 Python 中的列表？

Python csv 文件上的 PySpark distinct().count()

Python 如何在 PySpark 中删除 RDD 以释放资源？

相关推荐

最近更新

标签