pandas 为什么 DBSCAN 聚类在电影镜头数据集上返回单个聚类?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48051800/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why DBSCAN clustering returns single cluster on Movie lens data set?
提问by T3J45
The Scenario:
场景:
I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:
我正在对电影镜头数据集执行聚类,我有两种格式的数据集:
OLD FORMAT:
旧格式:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
NEW FORMAT:
新格式:
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN. With KMeans I'm able to set and get clusters.
我需要使用 KMeans、DBSCAN 和 HDBSCAN 执行聚类。使用 KMeans,我可以设置和获取集群。
The Problem
问题
The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)
问题仅在 DBSCAN 和 HDBSCAN 中存在,我无法获得足够数量的集群(我知道我们无法手动设置集群)
Techniques Tried:
尝试的技术:
- Tried this with IRISdata-set, where I found Specieswasn't included. Clearly that is in String and besides is to be predicted, and everything just works finewith that Dataset (Snippet 1)
- Tried with Movie Lens 100K datasetin OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
- Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.
- 用IRIS data-set尝试了这个,我发现Species不包括在内。显然,这是在字符串中,此外还需要预测,并且该数据集一切正常(代码片段 1)
- 尝试使用旧格式的Movie Lens 100K数据集(有和没有 UID),因为我尝试了一个类比,UID == SPECIES,因此尝试没有它。(片段 2)
- 用新格式(有和没有 UID)尝试过相同的结果,但结果以相同的方式结束。
Snippet 1:
片段 1:
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
Snippet 1 (Output):
片段 1(输出):
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Snippet 2:
片段 2:
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Snippet 2 (Output):
片段 2(输出):
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.
正如所见,它只返回 1 个集群。我想听听我做错了什么。
采纳答案by T3J45
As pointed by @faraway and @Anony-Mousse the solution is more of Mathematical on Dataset than Programming.
正如@faraway 和@Anony-Mousse 所指出的那样,解决方案更多是关于数据集的数学而不是编程。
Could finally figure out the clusters. Here's how:
终于可以弄清楚集群了。就是这样:
db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d
The Epsilon and Out-lier concept turned out more brightening from here.
从这里开始,Epsilon 和 Out-lier 概念变得更加明亮。
回答by Has QUIT--Anony-Mousse
You needto choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn'thave a default value for this parameter, it needs to be chosen for each data set differently.
您需要选择合适的参数。如果 epsilon 太小,一切都会变成噪音。sklearn不应为此参数设置默认值,需要为每个数据集选择不同的值。
You also need to preprocess your data.
您还需要预处理您的数据。
It's trivial to get "clusters" with kmeans that are meaningless...
用毫无意义的kmeans获得“集群”是微不足道的......
Don't just call random functions. You need to understandwhat you are doing, or you are just wasting your time.
不要只是调用随机函数。你需要了解你在做什么,否则你只是在浪费时间。
回答by faraway
Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).
首先,您需要预处理您的数据,删除任何无用的属性,例如 ID 和不完整的实例(以防您选择的距离度量无法处理它)。
It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).
很高兴理解这些算法来自两种不同的范式,基于质心 (KMeans) 和基于密度 (DBSCAN & HDBSCAN*)。虽然基于质心的算法通常将簇的数量作为输入参数,但基于密度的算法需要邻居的数量 (minPts) 和邻域的半径 (eps)。
Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).
通常在文献中,邻居的数量 (minPts) 设置为 4,并且通过分析不同的值找到半径 (eps)。您可能会发现 HDBSCAN* 更易于使用,因为您只需告知邻居的数量 (minPts)。
If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.
如果在尝试不同的配置后,您仍然得到无用的聚类,则可能您的数据根本没有聚类并且 KMeans 输出毫无意义。