Python pandas 数据框对象是否可以与 sklearn kmeans 聚类一起使用?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28017091/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Will pandas dataframe object work with sklearn kmeans clustering?
提问by Dark Knight
dataset is pandas dataframe. This is sklearn.cluster.KMeans
数据集是熊猫数据框。这是 sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
这就是我决定哪个实体属于哪个集群的方式:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
这是数据集的外观:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
其中 A、B、C 是索引
Is this the correct way of using k-means?
这是使用 k-means 的正确方法吗?
采纳答案by ogrisel
To know if your dataframe dataset
has suitable content you can explicitly convert to a numpy array:
要知道您的数据框dataset
是否有合适的内容,您可以显式转换为 numpy 数组:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype
(typically numpy.float64
) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler
for instance.
如果数组具有同构数值dtype
(通常为numpy.float64
),那么对于 scikit-learn 0.15.2 及更高版本应该没问题。例如,您可能仍然需要对数据进行规范化sklearn.preprocessing.StandardScaler
。
If your data frame is heterogeneously typed, the dtype
of the corresponding numpy array will be object
which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).
如果您的数据框是异构类型的dtype
,则对应的 numpy 数组的 将object
不适合 scikit-learn。您需要为所有相关特征提取数字表示(例如通过提取分类特征的虚拟变量)并删除不适合特征的列(例如样本标识符)。
回答by user666
Assuming all the values in the dataframe are numeric,
假设数据框中的所有值都是数字,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
或者,您可以尝试KMeans++ for Pandas。