Python scipy/numpy/pandas 中时间序列的分层聚类？

Question

提问by luca

I have a DataFrame with some time series. I created a correlation matrix from those time series and I'd like to create a hierarchical clustering on this correlation matrix. How can I do that?

我有一个带有一些时间序列的 DataFrame。我从这些时间序列创建了一个相关矩阵，我想在这个相关矩阵上创建一个层次聚类。我怎样才能做到这一点？

#
# let't pretend this DataFrame contains some time series
#
df = pd.DataFrame((np.random.randn(150)).reshape(10,15))

         0         1         2               13           14    
0  0.369746  0.093882 -0.656211 ....  -0.596936  0  0.095960  
1  0.641457  1.120405 -0.468639 ....  -2.070802  1 -1.254159  
2  0.360756 -0.222554  0.367893 ....   0.566299  2  0.932898  
3  0.733130  0.666270 -0.624351 ....  -0.377017  3  0.340360  
4 -0.263967  1.143818  0.554947 ....   0.220406  4 -0.585353  
5  0.082964 -0.311667  1.323161 ....  -1.190672  5 -0.828039  
6  0.173685  0.719818 -0.881854 ....  -1.048066  6 -1.388395  
7  0.118301 -0.268945  0.909022 ....   0.094301  7  1.111376  
8 -1.341381  0.599435 -0.318425 ....   1.053272  8 -0.763416  
9 -1.146692  0.453125  0.150241 ....   0.454584  9  1.506249

#
# I can create a correlation matrix like this 
#
correlation_matrix = df.corr(method='spearman')

          0         1  ...          13         14 
0   1.000000 -0.139394 ...    0.090909   0.309091 
1  -0.139394  1.000000 ...   -0.636364   0.115152 
2   0.175758  0.733333 ...   -0.515152  -0.163636 
3   0.309091  0.163636 ...   -0.248485  -0.127273 
4   0.600000 -0.103030 ...    0.151515   0.175758 
5  -0.078788  0.054545 ...   -0.296970  -0.187879 
6  -0.175758 -0.272727 ...    0.151515  -0.139394 
7   0.163636 -0.042424 ...    0.187879   0.248485 
8   0.030303  0.915152 ...   -0.430303   0.296970 
9  -0.696970  0.321212 ...   -0.236364  -0.151515 
10  0.163636  0.115152 ...   -0.163636   0.381818 
11  0.321212 -0.236364 ...   -0.127273  -0.224242 
12 -0.054545 -0.200000 ...    0.078788   0.236364 
13  0.090909 -0.636364 ...    1.000000   0.381818 
14  0.309091  0.115152 ...    0.381818   1.000000

Now, how can build the Hierarchical clustering on this matrix?

现在，如何在这个矩阵上建立层次聚类？

Answer 1

回答by luca

Here is a step by step guide on how to build the Hierarchical Clustering and Dendrogramout of our time series using SciPy. Please note that also scikit-learn (a powerful data analysis library built on top of SciPY) has many other clustering algorithmsimplemented.

这是有关如何使用 SciPy 从我们的时间序列中构建层次聚类和树状图的分步指南。请注意，scikit-learn（一个建立在 SciPY 之上的强大数据分析库）还实现了许多其他聚类算法。

First we build some synthetic time series to work with. We'll build 6 groups of correlated time series and we expect the hierarchical clustering to detect those six groups.

首先，我们构建一些合成时间序列来使用。我们将构建 6 组相关的时间序列，我们希望层次聚类能够检测到这 6 组。

import numpy as np
import seaborn as sns
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt

#
# build 6 time series groups for testing, called: a, b, c, d, e, f
#

num_samples = 61
group_size = 10

#
# create the main time series for each group
#

x = np.linspace(0, 5, num_samples)
scale = 4

a = scale * np.sin(x)
b = scale * (np.cos(1+x*3) + np.linspace(0, 1, num_samples))
c = scale * (np.sin(2+x*6) + np.linspace(0, -1, num_samples))
d = scale * (np.cos(3+x*9) + np.linspace(0, 4, num_samples))
e = scale * (np.sin(4+x*12) + np.linspace(0, -4, num_samples))
f = scale * np.cos(x)

#
# from each main series build 'group_size' series
#

timeSeries = pd.DataFrame()
ax = None
for arr in [a,b,c,d,e,f]:
    arr = arr + np.random.rand(group_size, num_samples) + np.random.randn(group_size, 1)
    df = pd.DataFrame(arr)
    timeSeries = timeSeries.append(df)

    # We use seaborn to plot what we have
    #ax = sns.tsplot(ax=ax, data=df.values, ci=[68, 95])
    ax = sns.tsplot(ax=ax, data=df.values, err_style="unit_traces")

plt.show()

Now we do the clustering and plot it:

现在我们进行聚类并绘制它：

# Do the clustering
Z = hac.linkage(timeSeries, method='single', metric='correlation')

# Plot dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

if we want to decide what kind of correlation to apply or to use another distance metric, then we can provide a custom metric function:

如果我们想决定应用哪种相关性或使用其他距离度量，那么我们可以提供自定义度量函数：

# Here we use spearman correlation
def my_metric(x, y):
    r = stats.pearsonr(x, y)[0]
    return 1 - r # correlation to distance: range 0 to 2

# Do the clustering    
Z = hac.linkage(timeSeries,  method='single', metric=my_metric)

# Plot dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

To retrieve the Clusters we can use the fcluster function. It can be run in multiple ways (check the documentation) but in this example we'll give it as target the number of clusters we want:

要检索集群，我们可以使用 fcluster 函数。它可以以多种方式运行（查看文档），但在这个例子中，我们将把它作为我们想要的集群数量的目标：

from scipy.cluster.hierarchy import fcluster

def print_clusters(timeSeries, Z, k, plot=False):
    # k Number of clusters I'd like to extract
    results = fcluster(Z, k, criterion='maxclust')

    # check the results
    s = pd.Series(results)
    clusters = s.unique()

    for c in clusters:
        cluster_indeces = s[s==c].index
        print("Cluster %d number of entries %d" % (c, len(cluster_indeces)))
        if plot:
            timeSeries.T.iloc[:,cluster_indeces].plot()
            plt.show()

print_clusters(timeSeries, Z, 6, plot=False)

Output:

输出：

Cluster 2 number of entries 10
Cluster 5 number of entries 10
Cluster 3 number of entries 10
Cluster 6 number of entries 10
Cluster 1 number of entries 10
Cluster 4 number of entries 10

Answer 2

回答by jarot

use:

用：

ax = sns.lineplot(ax=ax, data=df.values, dashes=False, style='choice')

and remember to provide transposed dataFrame i.e. df = pd.DataFrame(np.transpose(arr))

并记住提供转置数据帧即 df = pd.DataFrame(np.transpose(arr))

Python scipy/numpy/pandas 中时间序列的分层聚类？

提问by luca

回答by luca

回答by jarot

相关推荐

最近更新

标签

Python scipy/numpy/pandas 中时间序列的分层聚类？

提问by luca

回答by luca

回答by jarot

相关推荐

Pandas DataFrame.to_sql() 错误 - 并非所有参数都在字符串格式化期间转换

pandas 在 np.where 子句之后，熊猫无法识别 NaN。为什么？或者这是一个错误？

pandas 当列中有字符串时，如何从熊猫列中获取最长长度的字符串/整数/浮点数

pandas 用标题将数据框写入excel

相关推荐

最近更新

标签