Python 具有大量数据的散点图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4082298/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 14:09:50  来源:igfitidea点击:

Scatter plot with a huge amount of data

pythonnumpymatplotlib

提问by Nicola Vianello

I would like to use Matplotlibto generate a scatter plot with a huge amount of data (about 3 million points). Actually I've 3 vectors with the same dimension and I use to plot in the following way.

我想使用Matplotlib生成一个包含大量数据(约 300 万个点)的散点图。实际上,我有 3 个具有相同维度的向量,我使用以下方式进行绘图。

import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import * 
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
plt.scatter(delta,vf,c=dS,alpha=0.7,cmap=cm.Paired)

Nothing special actually. But it takes too long to generate it actually (I'm working on my MacBook Pro 4 GB RAM with Python 2.7 and Matplotlib 1.0). Is there any way to improve the speed?

其实没什么特别的。但是实际生成它需要很长时间(我正在使用 Python 2.7 和 Matplotlib 1.0 在我的 MacBook Pro 4 GB RAM 上工作)。有什么办法可以提高速度吗?

采纳答案by Paul

You could take the heatmap approach shown here. In this example the color represents the quantity of data in the bin, not the median value of the dS array, but that should be easy to change. More later if you are interested.

您可以采用此处显示的热图方法。在这个例子中,颜色代表 bin 中的数据量,而不是 dS 数组的中值,但这应该很容易改变。以后有兴趣再补充。

回答by unutbu

Unless your graphic is huge, many of those 3 million points are going to overlap. (A 400x600 image only has 240K dots...)

除非您的图形很大,否则这 300 万个点中的许多点都会重叠。(一个 400x600 的图像只有 240K 点...)

So the easiest thing to do would be to take a sample of say, 1000 points, from your data:

因此,最简单的方法是从您的数据中抽取 1000 个点的样本:

import random
delta_sample=random.sample(delta,1000)

and just plot that.

并绘制它。

For example:

例如:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import random

fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)

N=3*10**6
delta=np.random.normal(size=N)
vf=np.random.normal(size=N)
dS=np.random.normal(size=N)

idx=random.sample(range(N),1000)

plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

替代文字

Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_samplewhich has representatives from each bin.

或者,如果您需要更多地关注异常值,那么也许您可以使用 对数据进行分箱np.histogram,然后组合一个delta_sample包含来自每个分箱的代表。

Unfortunately, when using np.histogramI don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:

不幸的是,在使用时,np.histogram我认为没有任何简单的方法可以将 bin 与单个数据点相关联。一个简单但近似的解决方案是使用 bin 边缘内或上的点的位置作为其中点的代理:

xedges=np.linspace(-10,10,100)
yedges=np.linspace(-10,10,100)
zedges=np.linspace(-10,10,10)
hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
xidx,yidx,zidx=np.where(hist>0)
plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

替代文字

回答by conjectures

What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.

尝试pyplot.hexbin怎么?它根据一定数量的 bin 中的点密度生成一种热图。