Python matplotlib中具有重叠点的散点图的可视化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19064772/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:41:34  来源:igfitidea点击:

Visualization of scatter plots with overlapping points in matplotlib

pythonmatplotlibplotvisualizationscatter-plot

提问by papafe

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.

我必须在 matplotlib 的散点图中表示大约 30,000 个点。这些点属于两个不同的类,所以我想用不同的颜色来描绘它们。

I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region. I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.

我成功地这样做了,但有一个问题。点在许多区域重叠,我最后描绘的类将在另一个类的顶部可视化,隐藏它。此外,散点图无法显示每个区域中有多少点。我还尝试使用 histogram2d 和 imshow 制作二维直方图,但很难以清晰的方式显示属于这两个类的点。

Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?

你能提出一种方法来明确类的分布和点的集中度吗?

EDIT: To be more clear, this is the linkto my data file in the format "x,y,class"

编辑:更清楚地说,这是我的数据文件的 链接,格式为“x,y,class”

采纳答案by tom10

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)

一种方法是将数据绘制为具有低 alpha散点图,这样您就可以看到各个点以及粗略的密度度量。(这样做的缺点是该方法可以显示的重叠范围有限——即最大密度约为 1/alpha。)

Here's an example:

下面是一个例子:

enter image description here

在此处输入图片说明

As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).

可以想象,由于可以表达的重叠范围有限,因此需要在各个点的可见性和重叠量的表达(以及标记、绘图等的大小)之间进行权衡。

import numpy as np
import matplotlib.pyplot as plt

N = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x,y = np.random.multivariate_normal(mean, cov, N).T

plt.scatter(x, y, s=70, alpha=0.03)
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()

(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

(我在这里假设你的意思是 30e3 点,而不是 30e6。对于 30e6,我认为某种类型的平均密度图是必要的。)

回答by vishakad

You could also colour the points by first computing a kernel density estimate of the distribution of the scatter, and using the density values to specify a colour for each point of the scatter. To modify the code in the earlier example :

您还可以通过首先计算散点分布的核密度估计值来为点着色,然后使用密度值为散点的每个点指定颜色。要修改前面示例中的代码:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm

N = 10000
mean = [0,0]
cov = [[2,2],[0,2]]

samples = np.random.multivariate_normal(mean,cov,N).T
densObj = kde( samples )

def makeColours( vals ):
    colours = np.zeros( (len(vals),3) )
    norm = Normalize( vmin=vals.min(), vmax=vals.max() )

    #Can put any colormap you like here.
    colours = [cm.ScalarMappable( norm=norm, cmap='jet').to_rgba( val ) for val in vals]

    return colours

 colours = makeColours( densObj.evaluate( samples ) )

 plt.scatter( samples[0], samples[1], color=colours )
 plt.show()

Scatter plot with density information

具有密度信息的散点图

I learnt this trick a while ago when I noticed the documentation of the scatter function --

不久前,当我注意到 scatter 函数的文档时,我学会了这个技巧——

c : color or sequence of color, optional, default : 'b'

ccan be a single color format string, or a sequence of color specifications of length N, or a sequence of Nnumbers to be mapped to colors using the cmapand normspecified via kwargs (see below). Note that cshould not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. ccan be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

c可以是单个颜色格式字符串,或长度为 的颜色规范序列N,或N使用kwargs 指定的cmap和映射到颜色的数字序列norm(见下文)。请注意,c不应是单个数字 RGB 或 RGBA 序列,因为它与要进行颜色映射的值数组无法区分。 c可以是一个二维数组,其中的行是 RGB 或 RGBA,但是,包括为所有点指定相同颜色的单行的情况。