Python 具有大量数据的散点图
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4082298/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scatter plot with a huge amount of data
提问by Nicola Vianello
I would like to use Matplotlibto generate a scatter plot with a huge amount of data (about 3 million points). Actually I've 3 vectors with the same dimension and I use to plot in the following way.
我想使用Matplotlib生成一个包含大量数据(约 300 万个点)的散点图。实际上,我有 3 个具有相同维度的向量,我使用以下方式进行绘图。
import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import *
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
plt.scatter(delta,vf,c=dS,alpha=0.7,cmap=cm.Paired)
Nothing special actually. But it takes too long to generate it actually (I'm working on my MacBook Pro 4 GB RAM with Python 2.7 and Matplotlib 1.0). Is there any way to improve the speed?
其实没什么特别的。但是实际生成它需要很长时间(我正在使用 Python 2.7 和 Matplotlib 1.0 在我的 MacBook Pro 4 GB RAM 上工作)。有什么办法可以提高速度吗?
采纳答案by Paul
回答by unutbu
Unless your graphic is huge, many of those 3 million points are going to overlap. (A 400x600 image only has 240K dots...)
除非您的图形很大,否则这 300 万个点中的许多点都会重叠。(一个 400x600 的图像只有 240K 点...)
So the easiest thing to do would be to take a sample of say, 1000 points, from your data:
因此,最简单的方法是从您的数据中抽取 1000 个点的样本:
import random
delta_sample=random.sample(delta,1000)
and just plot that.
并绘制它。
For example:
例如:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import random
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
N=3*10**6
delta=np.random.normal(size=N)
vf=np.random.normal(size=N)
dS=np.random.normal(size=N)
idx=random.sample(range(N),1000)
plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
plt.show()


Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_samplewhich has representatives from each bin.
或者,如果您需要更多地关注异常值,那么也许您可以使用 对数据进行分箱np.histogram,然后组合一个delta_sample包含来自每个分箱的代表。
Unfortunately, when using np.histogramI don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:
不幸的是,在使用时,np.histogram我认为没有任何简单的方法可以将 bin 与单个数据点相关联。一个简单但近似的解决方案是使用 bin 边缘内或上的点的位置作为其中点的代理:
xedges=np.linspace(-10,10,100)
yedges=np.linspace(-10,10,100)
zedges=np.linspace(-10,10,10)
hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
xidx,yidx,zidx=np.where(hist>0)
plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
plt.show()


回答by conjectures
What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.
尝试pyplot.hexbin怎么样?它根据一定数量的 bin 中的点密度生成一种热图。

