Pandas:将多列绘制为相同的 x 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21109521/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:33:46  来源:igfitidea点击:

Pandas: plot multiple columns to same x value

pythonmatplotlibplotpandas

提问by erikfas

Followup to a previous questionregarding data analysis with pandas. I now want to plot my data, which looks like this:

先前有关使用Pandas进行数据分析的问题的跟进。我现在想绘制我的数据,如下所示:

PrEST ID    Gene    Sequence        Ratio1    Ratio2    Ratio3
HPRR12  ATF1    TTPSAXXXXXXXXXTTTK  6.3222    4.0558    4.958   
HPRR23  CREB1   KIXXXXXXXXPGVPR     NaN       NaN       NaN     
HPRR23  CREB1   ILNXXXXXXXXGVPR     0.22691   2.077     NaN
HPRR15  ELK4    IEGDCEXXXXXXXGGK    1.177     NaN       12.073  
HPRR15  ELK4    SPXXXXXXXXXXXSVIK   8.66      14.755    NaN
HPRR15  ELK4    IEGDCXXXXXXXVSSSSK  15.745    7.9122    9.5966  

... except there are a bunch more rows, and I don't actually want to plot the ratios but some other calculated values derived from them, but it doesn't matter for my plotting problem. I have a dataframe that looks more or less like that data above, and what I want is this:

...除了有更多的行,而且我实际上并不想绘制比率,而是从它们导出一些其他计算值,但这对我的绘图问题无关紧要。我有一个数据框,看起来或多或少像上面的数据,我想要的是:

  • Each row (3 ratios) should be plotted against the row's ID, as points
  • All rows with the same ID should be plotted to the same x value / ID, but with another colour
  • The x ticks should be the IDs, and (if possible) the corresponding gene as well (so some genes will appear on several x ticks, as they have multiple IDs mapping to them)
  • 每行(3 个比率)应根据行的 ID 绘制,作为点
  • 应将具有相同 ID 的所有行绘制为相同的 x 值/ID,但使用另一种颜色
  • x 刻度应该是 ID,并且(如果可能)也是相应的基因(因此某些基因会出现在几个 x 刻度上,因为它们有多个 ID 映射到它们)

Below is an image that my previous, non-pandas version of this script produces:

下面是我之前的这个脚本的非Pandas版本生成的图像:

enter image description here

在此处输入图片说明

... where the red triangles indicate values outside of a cutoff value used for setting the y-axis maximum value. The IDs are blacked-out, but you should be able to see what I'm after. Copy number is essentially the ratios with a calculation on top of them, so they're just another number rather than the ones I show in the data above.

... 其中红色三角形表示用于设置 y 轴最大值的截止值之外的值。这些 ID 被涂黑了,但你应该能够看到我在追求什么。拷贝数本质上是在它们之上进行计算的比率,因此它们只是另一个数字,而不是我在上面的数据中显示的数字。

I have tried to find similar questions and solutions in the documentation, but found none. Most people seem to need to do this with dates, for which there seem to be ready-made plotting functions, which doesn't help me (I think). Any help greatly appreciated!

我试图在文档中找到类似的问题和解决方案,但没有找到。大多数人似乎需要用日期来做这件事,似乎有现成的绘图功能,这对我没有帮助(我认为)。非常感谢任何帮助!

采纳答案by Noah Hafner

Skipping some of the finer points of plotting, to get:

跳过一些更精细的绘图点,以获得:

  • Each row (3 ratios) should be plotted against the row's ID, as points
  • All rows with the same ID should be plotted to the same x value / ID, but with another colour
  • The x ticks should be the IDs, and (if possible) the corresponding gene as well (so some genes will appear on several x ticks, as they have multiple IDs mapping to them)
  • 每行(3 个比率)应根据行的 ID 绘制,作为点
  • 应将具有相同 ID 的所有行绘制为相同的 x 值/ID,但使用另一种颜色
  • x 刻度应该是 ID,并且(如果可能)也是相应的基因(因此某些基因会出现在几个 x 刻度上,因为它们有多个 ID 映射到它们)

I suggest you try using matplotlib to handle the plotting, and manually cycle the colors. You can use something like:

我建议您尝试使用 matplotlib 来处理绘图,并手动循环颜色。你可以使用类似的东西:

import matplotlib.pyplot as plt
import pandas as pd
import itertools
#data
df = pd.DataFrame(
    {'id': [1, 2, 3, 3],
     'labels': ['HPRR1234', 'HPRR4321', 'HPRR2345', 'HPRR2345'],
     'g': ['KRAS', 'KRAS', 'ELK4', 'ELK4'],
     'r1': [15, 9, 15, 1],
     'r2': [14, 8, 7, 0],
     'r3': [14, 16, 9, 12]})
#extra setup
plt.rcParams['xtick.major.pad'] = 8
#plotting style(s)
marker = itertools.cycle((',', '+', '.', 'o', '*'))
color = itertools.cycle(('b', 'g', 'r', 'c', 'm', 'y', 'k'))
#plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['id'], df['r1'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
ax.plot(df['id'], df['r2'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
ax.plot(df['id'], df['r3'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
# set the tick labels
ax.xaxis.set_ticks(df['id'])
ax.xaxis.set_ticklabels(df['labels'])
plt.setp(ax.get_xticklabels(), rotation='vertical', fontsize=12)
plt.tight_layout()
fig.savefig("example.pdf")

If you have many rows, you will probably want more colors, but this shows at least the concept.

如果您有很多行,您可能需要更多颜色,但这至少显示了概念。

回答by erikfas

I managed to find a way to keep the string names! I thought about what you said about finding numbers for the IDs and figured I could use the index, which worked just fine.

我设法找到了保留字符串名称的方法!我想到了你所说的为 ID 查找数字的内容,并认为我可以使用索引,它工作得很好。

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df.index,df['r1'], ls='', marker=marker.next(), color=next(color))
ax.plot(df.index,df['r2'], ls='', marker=marker.next(), color=next(color))
ax.plot(df.index,df['r3'], ls='', marker=marker.next(), color=next(color))

ax.xaxis.set_ticks(df.index)
ax.xaxis.set_ticklabels(df['g'])

Now I've got some other problems, though. I did not realise it until now, but while plotting as above DOES work, it's not exactlyin the way I wanted it. Doing it like this will give me three values per ID x tick, and then the plotting continuesbeyond the x-axis limits, with three more values per tick (although there are not more ticks). It looks like this:

不过,现在我还有一些其他问题。直到现在我才意识到这一点,但是虽然按上述方式绘制确实有效,但它并不完全符合我的要求。这样做会给我每个 ID x 刻度三个值,然后绘图继续超出 x 轴限制,每个刻度增加三个值(尽管没有更多刻度)。它看起来像这样:

Weird plot beyond x ticks

超出 x 个刻度的奇怪图

What is wrong here, and why won't all the values map to the correct ID?

这里有什么问题,为什么不是所有的值都映射到正确的 ID?

回答by szeitlin

I have had similar problems. I think the issue you're having with mismatched labels & markers is because of how you're iterating through the data.

我也遇到过类似的问题。我认为您遇到标签和标记不匹配的问题是因为您遍历数据的方式。

Suggestions for getting pandas to work:

让Pandas工作的建议:

As other people mentioned, I always start by double-checking data types. Make sure you don't have any rows with strange things in them (NaNs, symbols, or other missing values, will often cause this type of error with plotting packages).

正如其他人提到的,我总是从仔细检查数据类型开始。确保您没有任何包含奇怪内容的行(NaN、符号或其他缺失值,通常会导致绘制包时出现此类错误)。

Drop NAs if you haven't already, then explicitly convert whole columns to the appropriate dtype as needed.

如果您还没有删除 NA,然后根据需要将整列显式转换为适当的 dtype。

In pandas, an 'object' is not the same as a 'string', and some of the plotting packages don't like 'objects' (see below).

在Pandas中,“对象”与“字符串”不同,并且一些绘图包不喜欢“对象”(见下文)。

I have also run into strange problems sometimes if my index wasn't continuous (if you drop NAs, you may have to reindex), or if my x-axis values weren't pre-sorted.

如果我的索引不是连续的(如果你删除 NA,你可能需要重新索引),或者如果我的 x 轴值没有预先排序,我有时也会遇到奇怪的问题。

(Note that matplotlib prefers numbers, but other plotting packages can handle categorical data in ways that will make your life a lot easier.)

(请注意,matplotlib 更喜欢数字,但其他绘图包可以以让您的生活更轻松的方式处理分类数据。)

Lately I am using seaborn, which doesn't seem to have the same kinds of problems with 'objects'. Specifically, you might want to take a look at seaborn's factorplot. Seaborn also has easy options for color palettes, so that might solve more than one of these issues for you.

最近我在使用seaborn,它似乎没有与“对象”相同的问题。具体来说,您可能想查看 seaborn 的因子图。Seaborn 还提供了简单的调色板选项,因此可能会为您解决多个问题。

Some pandas tricks you might want to try, if you haven't already:

如果您还没有尝试,您可能想尝试一些 Pandas 技巧:

converting your code objects explicitly to strings:

将您的代码对象显式转换为字符串:

df['code_as_word'] = df['secretcodenumber'].astype(str)

df['code_as_word'] = df['secretcodenumber'].astype(str)

Or drop the letters, as you suggested, and convert objects to numeric instead:

或者按照您的建议删除字母,并将对象转换为数字:

df = df.convert_objects(convert_numeric=True)

df = df.convert_objects(convert_numeric=True)