从 Python 的 Pandas 中的数据帧制作 matplotlib 散点图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14300137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:01:13  来源:igfitidea点击:

making matplotlib scatter plots from dataframes in Python's pandas

pythonmatplotlibplotdataframepandas

提问by

What is the best way to make a series of scatter plots using matplotlibfrom a pandasdataframe in Python?

使用Python 中matplotlibpandas数据框制作一系列散点图的最佳方法是什么?

For example, if I have a dataframe dfthat has some columns of interest, I find myself typically converting everything to arrays:

例如,如果我有一个df包含一些感兴趣列的数据框,我发现自己通常会将所有内容都转换为数组:

import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2 
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])

The problem with converting everything to array before plotting is that it forces you to break out of dataframes.

在绘图之前将所有内容转换为数组的问题在于它迫使您打破数据框。

Consider these two use cases where having the full dataframe is essential to plotting:

考虑这两个用例,其中拥有完整的数据框对于绘图至关重要:

  1. For example, what if you wanted to now look at all the values of col3for the corresponding values that you plotted in the call to scatter, and color each point (or size) it by that value? You'd have to go back, pull out the non-na values of col1,col2and check what their corresponding values.

    Is there a way to plot while preserving the dataframe? For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"])
    # plot a scatter of col1 by col2, with sizes according to col3
    scatter(mydata(["col1", "col2"]), s=mydata["col3"])
    
  2. Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"]) 
    myscatter = scatter(mydata[["col1", "col2"]], s=1)
    # Plot in red, with smaller size, all the points that 
    # have a col2 value greater than 0.5
    myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
    
  1. 例如,如果您现在想查看在col3调用中绘制的相应值的所有值scatter,并根据该值为每个点(或大小)着色,该怎么办?您必须返回,取出的非 na 值col1,col2并检查它们对应的值。

    有没有办法在保留数据框的同时进行绘图?例如:

    mydata = df.dropna(how="any", subset=["col1", "col2"])
    # plot a scatter of col1 by col2, with sizes according to col3
    scatter(mydata(["col1", "col2"]), s=mydata["col3"])
    
  2. 同样,假设您想根据其中某些列的值对每个点进行不同的过滤或着色。例如,如果您想自动绘制符合特定截止点的点col1, col2的标签(其中标签存储在 df 的另一列中),或者以不同的方式为这些点着色,就像人们对 R 中的数据框所做的那样,该怎么办?例子:

    mydata = df.dropna(how="any", subset=["col1", "col2"]) 
    myscatter = scatter(mydata[["col1", "col2"]], s=1)
    # Plot in red, with smaller size, all the points that 
    # have a col2 value greater than 0.5
    myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
    

How can this be done?

如何才能做到这一点?

EDITReply to crewbum:

编辑回复船员:

You say that the best way is to plot each condition (like subset_a, subset_b) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?

您说最好的方法是分别绘制每个条件(如subset_a, subset_b)。如果您有很多条件,例如您想将散点分成 4 种类型的点或什至更多,以不同的形状/颜色绘制每个点,该怎么办。您如何优雅地应用条件 a、b、c 等,并确保将“其余部分”(不属于这些条件中的任何一个)绘制为最后一步?

Similarly in your example where you plot col1,col2differently based on col3, what if there are NA values that break the association between col1,col2,col3? For example if you want to plot all col2values based on their col3values, but some rows have an NA value in either col1or col3, forcing you to use dropnafirst. So you would do:

同样,在您col1,col2根据绘制不同的示例中col3,如果 NA 值破坏了 之间的关联,该col1,col2,col3怎么办?例如,如果你想绘制所有col2基于自己的价值观col3价值,但某些行有任何的NA值col1col3会迫使用户使用dropna第一。所以你会这样做:

mydata = df.dropna(how="any", subset=["col1", "col2", "col3")

then you can plot using mydatalike you show -- plotting the scatter between col1,col2using the values of col3. But mydatawill be missing some points that have values for col1,col2but are NA for col3, and those still have to be plotted... so how would you basically plot "the rest" of the data, i.e. the points that are notin the filtered set mydata?

然后你可以mydata像你展示的那样绘制 - 绘制col1,col2使用col3. 但是mydata会丢失一些具有col1,col2NA值但 NA 的点col3,并且仍然需要绘制这些点...那么您将如何基本上绘制数据的“其余部分”,即不在过滤集中的点mydata

采纳答案by Garrett

Try passing columns of the DataFramedirectly to matplotlib, as in the examples below, instead of extracting them as numpy arrays.

尝试将 的列DataFrame直接传递给 matplotlib,如下例所示,而不是将它们提取为 numpy 数组。

df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100

In [5]: df
Out[5]: 
       col1      col2  col3
0 -1.000075 -0.759910   100
1  0.510382  0.972615   200
2  1.872067 -0.731010   500
3  0.131612  1.075142  1000
4  1.497820  0.237024  1700

Vary scatter point size based on another column

根据另一列改变散点大小

plt.scatter(df.col1, df.col2, s=df.col3)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=df.col3)

enter image description here

在此处输入图片说明

Vary scatter point color based on another column

根据另一列改变散点颜色

colors = np.where(df.col3 > 300, 'r', 'k')
plt.scatter(df.col1, df.col2, s=120, c=colors)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=120, c=colors)

enter image description here

在此处输入图片说明

Scatter plot with legend

带有图例的散点图

However, the easiest way I've found to create a scatter plot with legend is to call plt.scatteronce for each point type.

但是,我发现创建带有图例的散点图的最简单方法是plt.scatter为每个点类型调用一次。

cond = df.col3 > 300
subset_a = df[cond].dropna()
subset_b = df[~cond].dropna()
plt.scatter(subset_a.col1, subset_a.col2, s=120, c='b', label='col3 > 300')
plt.scatter(subset_b.col1, subset_b.col2, s=60, c='r', label='col3 <= 300') 
plt.legend()

enter image description here

在此处输入图片说明

Update

更新

From what I can tell, matplotlib simply skips points with NA x/y coordinates or NA style settings (e.g., color/size). To find points skipped due to NA, try the isnullmethod: df[df.col3.isnull()]

据我所知,matplotlib 只是跳过具有 NA x/y 坐标或 NA 样式设置(例如,颜色/大小)的点。要查找由于 NA 跳过的点,请尝试以下isnull方法:df[df.col3.isnull()]

To split a list of points into many types, take a look at numpy select, which is a vectorized if-then-else implementation and accepts an optional default value. For example:

要将点列表拆分为多种类型,请查看numpyselect,它是矢量化的 if-then-else 实现并接受可选的默认值。例如:

df['subset'] = np.select([df.col3 < 150, df.col3 < 400, df.col3 < 600],
                         [0, 1, 2], -1)
for color, label in zip('bgrm', [0, 1, 2, -1]):
    subset = df[df.subset == label]
    plt.scatter(subset.col1, subset.col2, s=120, c=color, label=str(label))
plt.legend()

enter image description here

在此处输入图片说明

回答by serv-inc

There is little to be added to Garrett's great answer, but pandas also has a scattermethod. Using that, it's as easy as

Garrett 的好答案几乎没有什么可添加的,但是 pandas 也有一个scatter方法。使用它,就像

df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100
df.plot.scatter('col1', 'col2', df['col3'])

plotting sizes in col3 to col1-col2

将 col3 中的大小绘制到 col1-col2

回答by Dr. Arslan

I will recommend to use an alternative method using seabornwhich more powerful tool for data plotting. You can use seaborn scatterplotand define colum 3 as hueand size.

我将建议使用另一种方法,使用seaborn更强大的数据绘图工具。您可以使用seaborn scatterplot和定义第 3 列作为huesize

Working code:

工作代码:

import pandas as pd
import seaborn as sns
import numpy as np

#creating sample data 
sample_data={'col_name_1':np.random.rand(20),
      'col_name_2': np.random.rand(20),'col_name_3': np.arange(20)*100}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df, hue="col_name_3",size="col_name_3")

enter image description here

在此处输入图片说明