基于列标签在 Pandas 中重塑数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14916358/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reshaping dataframes in pandas based on column labels
提问by
What is the best way to reshape the following dataframe in pandas? This DataFrame dfhas x,yvalues for each sample (s1and s2in this case) and looks like this:
在Pandas中重塑以下数据框的最佳方法是什么?这个数据帧df具有x,y值对于每个样品(s1和s2在这种情况下),看起来像这样:
In [23]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [24]: df
Out[24]:
s1_x s1_y s2_x s2_y
0 0.913462 0.525590 -0.377640 0.700720
1 0.723288 -0.691715 0.127153 0.180836
2 0.181631 -1.090529 -1.392552 1.530669
3 0.997414 -1.486094 1.207012 0.376120
4 -0.319841 0.195289 -1.034683 0.286073
5 1.085154 -0.619635 0.396867 0.623482
6 1.867816 -0.928101 -0.491929 -0.955295
7 0.920658 -1.132057 1.701582 -0.110299
8 -0.241853 -0.129702 -0.809852 0.014802
9 -0.019523 -0.578930 0.803688 -0.881875
s1_xand s1_yare the x/y values for sample 1, s2_x, s2_yare the sample values for sample 2, etc. How can this be reshaped into a DataFrame containing only x, ycolumns but that contains an additional column samplethat says for each row in the DataFrame whether it's from s1or s2? E.g.
s1_x并且s1_y是样本 1 的 x/y 值,s2_x, s2_y是样本 2 的样本值等。如何将其重塑为仅包含x,y列但包含一个附加列sample的数据帧,该列表示数据帧中的每一行是否来自s1或者s2?例如
x y sample
0 0.913462 0.525590 s1
1 0.723288 -0.691715 s1
2 0.181631 -1.090529 s1
3 0.997414 -1.486094 s1
...
5 0.396867 0.623482 s2
...
This is useful for plotting things with Rpy2 later on, since many R plotting features can make use of this grouping variable, so that's my motivation for reshaping the dataframe.
这对于稍后使用 Rpy2 绘图很有用,因为许多 R 绘图功能可以利用这个分组变量,所以这就是我重塑数据框的动机。
I think the answer given by Chang She doesn't translate to dataframes that have a unique index, like this one:
我认为 Chang She 给出的答案并没有转化为具有唯一索引的数据帧,如下所示:
In [636]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10), "names": range(10)})
In [637]: df
Out[637]:
names s1_x s1_y s2_x s2_y
0 0 0.672298 0.415366 1.034770 0.556209
1 1 0.067087 -0.851028 0.053608 -0.276461
2 2 -0.674174 -0.099015 0.864148 -0.067240
3 3 0.542996 -0.813018 2.283530 2.793727
4 4 0.216633 -0.091870 -0.746411 -0.421852
5 5 0.141301 -1.537721 -0.371601 -1.594634
6 6 1.267148 -0.833120 0.369516 -0.671627
7 7 -0.231163 -0.557398 1.123155 0.865140
8 8 1.790570 -0.428563 0.668987 0.632409
9 9 -0.820315 -0.894855 0.673247 -1.195831
In [638]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [639]: df.stack(0).reset_index(1)
Out[639]:
level_1 x y
0 s1 0.672298 0.415366
0 s2 1.034770 0.556209
1 s1 0.067087 -0.851028
1 s2 0.053608 -0.276461
2 s1 -0.674174 -0.099015
2 s2 0.864148 -0.067240
3 s1 0.542996 -0.813018
3 s2 2.283530 2.793727
4 s1 0.216633 -0.091870
4 s2 -0.746411 -0.421852
5 s1 0.141301 -1.537721
5 s2 -0.371601 -1.594634
6 s1 1.267148 -0.833120
6 s2 0.369516 -0.671627
7 s1 -0.231163 -0.557398
7 s2 1.123155 0.865140
8 s1 1.790570 -0.428563
8 s2 0.668987 0.632409
9 s1 -0.820315 -0.894855
9 s2 0.673247 -1.195831
The transformation worked but in the process the column "names"was lost. How can I keep the "names"column in the df while still doing the melting transformation on the columns that have _in their names? The "names"column just assigns a unique name to each row in the dataframe. It's numeric here for example but in my data they are string identifiers.
转换成功了,但在此过程中色谱柱"names"丢失了。如何将"names"列保留在 df 中,同时仍_对其名称中的列进行熔化转换?该"names"列只是为数据框中的每一行分配一个唯一的名称。例如,这里是数字,但在我的数据中,它们是字符串标识符。
thanks.
谢谢。
采纳答案by Chang She
I'm assuming you already have the DataFrame. In which case you can just turn the columns into a MultiIndex and use stack then reset_index. Note that you'll then have to rename and reorder the columns and sort by sample to get exactlywhat you posted in the question:
我假设您已经拥有 DataFrame。在这种情况下,您可以将列转换为 MultiIndex 并使用堆栈然后使用 reset_index。请注意,然后您必须重命名和重新排序列并按样本排序以准确获取您在问题中发布的内容:
In [4]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [5]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [6]: df.stack(0).reset_index(1)
Out[6]:
level_1 x y
0 s1 0.897994 -0.278357
0 s2 -0.008126 -1.701865
1 s1 -1.354633 -0.890960
1 s2 -0.773428 0.003501
2 s1 -1.499422 -1.518993
2 s2 0.240226 1.773427
3 s1 -1.090921 0.847064
3 s2 -1.061303 1.557871
4 s1 -1.697340 -0.160952
4 s2 -0.930642 0.182060
5 s1 -0.356076 -0.661811
5 s2 0.539875 -1.033523
6 s1 -0.687861 -1.450762
6 s2 0.700193 0.658959
7 s1 -0.130422 -0.826465
7 s2 -0.423473 -1.281856
8 s1 0.306983 0.433856
8 s2 0.097279 -0.256159
9 s1 0.498057 0.147243
9 s2 1.312578 0.111837
You can save the MultiIndex conversion if you can just create the DataFrame with a MultiIndex instead.
如果您可以使用 MultiIndex 创建 DataFrame,则可以保存 MultiIndex 转换。
Edit: use merge to join original ids back in
编辑:使用合并将原始 ID 重新加入
In [59]: df
Out[59]:
names s1_x s1_y s2_x s2_y
0 0 0.732099 0.018387 0.299856 0.737142
1 1 0.914755 -0.798159 -0.732868 -1.279311
2 2 -1.063558 0.161779 -0.115751 -0.251157
3 3 -1.185501 0.095147 -1.343139 -0.003084
4 4 0.622400 -0.299726 0.198710 -0.383060
5 5 0.179318 0.066029 -0.635507 1.366786
6 6 -0.820099 0.066067 1.113402 0.002872
7 7 0.711627 -0.182925 1.391194 -2.788434
8 8 -1.124092 1.303375 0.202691 -0.225993
9 9 -0.179026 0.847466 -1.480708 -0.497067
In [60]: id = df.ix[:, ['names']]
In [61]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [62]: pandas.merge(df.stack(0).reset_index(1), id, left_index=True, right_index=True)
Out[62]:
level_1 x y names
0 s1 0.732099 0.018387 0
0 s2 0.299856 0.737142 0
1 s1 0.914755 -0.798159 1
1 s2 -0.732868 -1.279311 1
2 s1 -1.063558 0.161779 2
2 s2 -0.115751 -0.251157 2
3 s1 -1.185501 0.095147 3
3 s2 -1.343139 -0.003084 3
4 s1 0.622400 -0.299726 4
4 s2 0.198710 -0.383060 4
5 s1 0.179318 0.066029 5
5 s2 -0.635507 1.366786 5
6 s1 -0.820099 0.066067 6
6 s2 1.113402 0.002872 6
7 s1 0.711627 -0.182925 7
7 s2 1.391194 -2.788434 7
8 s1 -1.124092 1.303375 8
8 s2 0.202691 -0.225993 8
9 s1 -0.179026 0.847466 9
9 s2 -1.480708 -0.497067 9
Alternatively:
或者:
In [64]: df
Out[64]:
names s1_x s1_y s2_x s2_y
0 0 0.744742 -1.123403 0.212736 0.005440
1 1 0.465075 -0.673491 1.467156 -0.176298
2 2 -1.111566 0.168043 -0.102142 -1.072461
3 3 1.226537 -1.147357 -1.583762 -1.236582
4 4 1.137675 0.224422 0.738988 1.528416
5 5 -0.237014 -1.110303 -0.770221 1.389714
6 6 -0.659213 2.305374 -0.326253 1.416778
7 7 1.524214 -0.395451 -1.884197 0.524606
8 8 0.375112 -0.622555 0.295336 0.927208
9 9 1.168386 -0.291899 -1.462098 0.250889
In [65]: df = df.set_index('names')
In [66]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [67]: df.stack(0).reset_index(1)
Out[67]:
level_1 x y
names
0 s1 0.744742 -1.123403
0 s2 0.212736 0.005440
1 s1 0.465075 -0.673491
1 s2 1.467156 -0.176298
2 s1 -1.111566 0.168043
2 s2 -0.102142 -1.072461
3 s1 1.226537 -1.147357
3 s2 -1.583762 -1.236582
4 s1 1.137675 0.224422
4 s2 0.738988 1.528416
5 s1 -0.237014 -1.110303
5 s2 -0.770221 1.389714
6 s1 -0.659213 2.305374
6 s2 -0.326253 1.416778
7 s1 1.524214 -0.395451
7 s2 -1.884197 0.524606
8 s1 0.375112 -0.622555
8 s2 0.295336 0.927208
9 s1 1.168386 -0.291899
9 s2 -1.462098 0.250889

