如何使用包含字符串的某些列在 Pandas DataFrame 上绘制平行坐标？

Question

提问by Cedric Zoppolo

I would like to plot parallel coordinates for a pandasDataFrame containing columns with numbers and other columns containing strings as values.

我想为pandas包含带有数字的列和其他包含字符串作为值的列的DataFrame绘制平行坐标。

Problem description

问题描述

I have following test code which works for plotting parallel coordinates with numbers:

我有以下测试代码，用于绘制带有数字的平行坐标：

import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates

df = pd.DataFrame([["line 1",20,30,100],\
    ["line 2",10,40,90],["line 3",10,35,120]],\
    columns=["element","var 1","var 2","var 3"])
parallel_coordinates(df,"element")
plt.show()

Which ends up showing following graphic:

最终显示以下图形：

However what I would like to attempt is to add some variables to my plot that have strings. But when I run following code:

但是，我想尝试的是在我的绘图中添加一些带有字符串的变量。但是当我运行以下代码时：

df2 = pd.DataFrame([["line 1",20,30,100,"N"],\
    ["line 2",10,40,90,"N"],["line 3",10,35,120,"N-1"]],\
    columns=["element","var 1","var 2","var 3","regime"])
parallel_coordinates(df2,"element")
plt.show()

I get this error:

我收到此错误：

ValueError: invalid literal for float(): N

ValueError：float() 的无效文字：N

Which I suppose means parallel_coordinatesfunction does not accept strings.

我想这意味着parallel_coordinates函数不接受字符串。

Example of what I am trying to do

我正在尝试做的示例

I am attemting to do something like this example, where Race and Sex are strings and not numbers:

我正在尝试做类似这个例子的事情，其中种族和性别是字符串而不是数字：

Question

题

Is there any way to perform such a graphic using pandasparallel_coordinates? If not, how could I attempt such graphic? Maybe with matplotlib?

有什么方法可以使用执行这样的图形pandasparallel_coordinates吗？如果没有，我怎么能尝试这样的图形？也许与matplotlib？

I must mention I am particularily looking for a solution under Python 2.5with pandas version 0.9.0.

我必须提到我特别在Python 2.5下寻找带有 pandas 版本的解决方案0.9.0。

Answer 1

回答by Diziet Asahi

It wasn't entirely clear to me what you wanted to do with the regimecolumn.

我并不完全清楚你想用这个regime专栏做什么。

If the problem was just that its presence prevented the plot to show, then you could simply omit the offending columns from the plot:

如果问题只是它的存在阻止了情节的显示，那么您可以简单地从情节中省略有问题的列：

parallel_coordinates(df2, class_column='element', cols=['var 1', 'var 2', 'var 3'])

looking at the example you provided, I then understood you want categorical variables to be somehow placed on a vertical lines, and each value of the category is represented by a different y-value. Am I getting this right?

查看您提供的示例，然后我明白您希望以某种方式将分类变量放置在垂直线上，并且类别的每个值都由不同的 y 值表示。我做对了吗？

If I am, then you need to encore your categorical variables (here, regime) into a numerical value. To do this, I used this tip I found on this website.

如果我是，那么您需要将分类变量（此处为regime）转换为数值。为此，我使用了在本网站上找到的提示。

df2.regime = df2.regime.astype('category')
df2['regime_encoded'] = df2.regime.cat.codes


print(df2)
    element var 1   var 2   var 3   regime  regime_encoded
0   line 1  20      30      100     N       0
1   line 2  10      40      90      N       0
2   line 3  10      35      120     N-1     1

this code creates a new column (regime_encoded) where each value of the category regime is coded by an integer. You can then plot your new dataframe, including the newly created column:

此代码创建一个新列 ( regime_encoded)，其中类别制度的每个值都由一个整数编码。然后，您可以绘制新数据框，包括新创建的列：

parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")

The problem is that the encoding values for the categorical variable (0, 1) have nothing to do with the range of your other variables, so all the lines seem to tend toward the same point. The answer is then to scale the encoding compared to the range of your data (here I did it very simply because your data was bounded between 0 and 120, you probably need to scale from the minimum value if that's not the case in your real dataframe).

问题是分类变量 (0, 1) 的编码值与其他变量的范围无关，因此所有行似乎都趋向于同一点。答案是与数据范围相比缩放编码（在这里我这样做非常简单，因为你的数据在 0 到 120 之间，如果在你的真实数据帧中不是这种情况，你可能需要从最小值开始缩放）。

df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True))
parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")

To fit with your example better, you can add annotations:

为了更好地适应您的示例，您可以添加注释：

df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True)
parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
ax = plt.gca()
for i,(label,val) in df2.loc[:,['regime','regime_encoded']].drop_duplicates().iterrows():
    ax.annotate(label, xy=(3,val), ha='left', va='center')

Answer 2

回答by Cedric Zoppolo

Based on @Diziet answer, to be able to get the desired graph under Python 2.5we can use following code:

基于@Diziet 的回答，为了能够在Python 2.5下获得所需的图形，我们可以使用以下代码：

import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates

def format(input):
    if input == "N":
        output = 0
    elif input == "N-1":
        output = 1
    else:
        output = None
    return output

df2 = pd.DataFrame([["line 1",20,30,100,"N"],\
    ["line 2",10,40,90,"N"],["line 3",10,35,120,"N-1"]],\
    columns=["element","var 1","var 2","var 3","regime"])
df2["regime_encoded"] = df2["regime"].apply(format) * max(df2[["var 1","var 2","var 3"]].max(axis=1))

parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
ax = plt.gca()
for i,(label,val) in df2.ix[:,['regime','regime_encoded']].drop_duplicates().iterrows():
    ax.annotate(label, xy=(3,val), ha='left', va='center')

plt.show()

This will end up showing following graph:

这将最终显示以下图表：

如何使用包含字符串的某些列在 Pandas DataFrame 上绘制平行坐标？

提问by Cedric Zoppolo

回答by Diziet Asahi

回答by Cedric Zoppolo

相关推荐

最近更新

标签

如何使用包含字符串的某些列在 Pandas DataFrame 上绘制平行坐标？

提问by Cedric Zoppolo

回答by Diziet Asahi

回答by Cedric Zoppolo

相关推荐

Scatter_Matrix 不会显示使用 Pandas 和

在 Pandas 数据框中的特定索引处插入新行

pandas 熊猫数据框中的日期时间不会相互减去

Pandas：将所有列从字符串转换为数字，除了两个？

相关推荐

最近更新

标签