Pandas stack/groupby 创建一个新的数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18743340/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas stack/groupby to make a new dataframe
提问by jonas
I have a problem of creating and rearrange a dataset. I Looked at the pandas groupby function and think it might help me do it but I am to inexperienced to make it happen. I have created an example of my problem below: my df:
我在创建和重新排列数据集时遇到问题。我查看了 pandas groupby 函数并认为它可能会帮助我做到这一点,但我没有经验来实现它。我在下面创建了一个问题示例:我的 df:
     vehicle  color  a  b  c  d  A1  A2  A3  B1  B2  B3  C1  C2  C3  D1  D2  D3
resp                                                                           
1       bike  green  5  4  1  3   3   4   5   3   5   3 NaN NaN NaN NaN NaN NaN
2       walk    red  5  3  3  3   4   5   3   3   5   4 NaN NaN NaN NaN NaN NaN
3        car  green  4  2  3  3   4   3   5   4   5   5 NaN NaN NaN NaN NaN NaN
4        car   blue  4  5  4  4 NaN NaN NaN NaN NaN NaN   5   5   5   3   3   4
5        bus  black  2  4  4  3 NaN NaN NaN   2   3   3   2   2   1 NaN NaN NaN
6        car    red  4  2  3  3   3   4   4 NaN NaN NaN   4   4   4 NaN NaN NaN
7        bus   blue  5  5  2  3   3   3   5   4   3   2 NaN NaN NaN NaN NaN NaN
8       walk    red  3  3  4  3 NaN NaN NaN   5   5   5   5   3   3 NaN NaN NaN
9        car   blue  5  3  4  3   3   3   3 NaN NaN NaN   4   3   4 NaN NaN NaN
The dataset contains respondents and answers to a questionare. What I would like to do is to make a new dataframe with resp as index and the data from how the respondents answered rearranged. The data in columns a,b,c,d, vehicle and color are stacked for the respondents (Hope thats the right way to express it) in the new dataframe. Also the values from columns A to C are in the new frame under columns BL_val. Only the data that corresponds from Capital letter (A1-D3) to small letter (a,b,c,d) are filled in. The rest are NAN.
数据集包含受访者和问题的答案。我想做的是制作一个新的数据框,以 resp 作为索引,并重新排列受访者回答的数据。a、b、c、d、车辆和颜色列中的数据在新数据框中为受访者堆叠(希望这是正确的表达方式)。此外,从 A 列到 C 列的值位于列 BL_val 下的新框架中。只填写大写字母(A1-D3)到小写字母(a,b,c,d)对应的数据,其余为NAN。
I would like to create a new dataframe from this and it shoul look like:
我想从中创建一个新的数据框,它应该如下所示:
ds:
ds:
     vehicle  color sl  sl_val  BL_val1  BL_val2  BL_val3
resp                                                     
1       bike  green  a       5        3        4        5
1       bike  green  b       4        3        5        3
1       bike  green  c       1      NaN      NaN      NaN
1       bike  green  d       3      NaN      NaN      NaN
2       walk    red  a       5        4        5        3
2       walk    red  b       3        3        5        4
2       walk    red  c       3      NaN      NaN      NaN
2       walk    red  d       3      NaN      NaN      NaN
3        car  green  a       4        4        3        5
3        car  green  b       2        4        5        5
3        car  green  c       3      NaN      NaN      NaN
3        car  green  d       3      NaN      NaN      NaN
4        car   blue  a       4      NaN      NaN      NaN
4        car   blue  b       5      NaN      NaN      NaN
4        car   blue  c       4        5        5        5
4        car   blue  d       4        3        3        4
5        bus  black  a       2      NaN      NaN      NaN
5        bus  black  b       4        2        3        3
5        bus  black  c       4        2        2        1
5        bus  black  d       3      NaN      NaN      NaN
6        car    red  a       4        3        4        4
6        car    red  b       2      NaN      NaN      NaN
6        car    red  c       3        4        4        4
6        car    red  d       3      NaN      NaN      NaN
7        bus   blue  a       5        3        3        5
7        bus   blue  b       5        4        3        2
7        bus   blue  c       2      NaN      NaN      NaN
7        bus   blue  d       3      NaN      NaN      NaN
8       walk    red  a       3      NaN      NaN      NaN
8       walk    red  b       3        5        5        5
8       walk    red  c       4        5        3        3
8       walk    red  d       3      NaN      NaN      NaN
9        car   blue  a       5        3        3        3
9        car   blue  b       3      NaN      NaN      NaN
9        car   blue  c       4        4        3        4
9        car   blue  d     NaN      NaN      NaN      NaN
I really need some help with this, I cant figure it out!!
我真的需要一些帮助,我想不通!!
采纳答案by DSM
There might be a more pandorable way to do this, but I find that the pattern of using groupbyto make the groups, performing explicit operations on them, and then recombining, is often a simple way to get what I want.  Sure, I could spend half an hour coming up with something more elegant, but then I wouldn't have any time to hang out on SO..
可能有一种更受欢迎的方法来做到这一点,但我发现使用groupby创建组、对它们执行显式操作然后重新组合的模式通常是获得我想要的东西的简单方法。当然,我可以花半个小时想出一些更优雅的东西,但那样我就没有时间闲逛 SO..
Anyway, how about something like this?
无论如何,这样的事情怎么样?
df = df.set_index(["resp", "vehicle", "color"])
grouped = df.groupby(lambda x: x[0].lower(), axis=1)
new_grouped = []
for key, group in grouped:
    group.columns = ["sl_val"] + ["BL_val{}".format(i) for i in range(1,4)]
    group["sl"] = key
    new_grouped.append(group)
df2 = pd.concat(new_grouped).reset_index()
df2 = df2.sort(["resp", "vehicle", "color"]).set_index("resp")
df2 = df2[["vehicle", "color", "sl"] + [k for k in df2.columns if "_" in k]]
Starting from
从...开始
>>> df = df.set_index(["resp", "vehicle", "color"])
>>> df
                    a  b  c  d  A1  A2  A3  B1  B2  B3  C1  C2  C3  D1  D2  D3
resp vehicle color                                                            
1    bike    green  5  4  1  3   3   4   5   3   5   3 NaN NaN NaN NaN NaN NaN
2    walk    red    5  3  3  3   4   5   3   3   5   4 NaN NaN NaN NaN NaN NaN
3    car     green  4  2  3  3   4   3   5   4   5   5 NaN NaN NaN NaN NaN NaN
4    car     blue   4  5  4  4 NaN NaN NaN NaN NaN NaN   5   5   5   3   3   4
5    bus     black  2  4  4  3 NaN NaN NaN   2   3   3   2   2   1 NaN NaN NaN
6    car     red    4  2  3  3   3   4   4 NaN NaN NaN   4   4   4 NaN NaN NaN
7    bus     blue   5  5  2  3   3   3   5   4   3   2 NaN NaN NaN NaN NaN NaN
8    walk    red    3  3  4  3 NaN NaN NaN   5   5   5   5   3   3 NaN NaN NaN
9    car     blue   5  3  4  3   3   3   3 NaN NaN NaN   4   3   4 NaN NaN NaN
We can group by the starting letter in lowercase:
我们可以按小写的起始字母分组:
>>> grouped = df.groupby(lambda x: x[0].lower(), axis=1)
Producing a bunch of groups which look like:
产生一堆看起来像的组:
>>> next(iter(grouped))
('a',                     a  A1  A2  A3
resp vehicle color               
1    bike    green  5   3   4   5
2    walk    red    5   4   5   3
3    car     green  4   4   3   5
4    car     blue   4 NaN NaN NaN
5    bus     black  2 NaN NaN NaN
6    car     red    4   3   4   4
7    bus     blue   5   3   3   5
8    walk    red    3 NaN NaN NaN
9    car     blue   5   3   3   3)
Then we simply change the names, add the "sl"column, and recombine them using pd.concat.  The last bits simply match your desired order.
然后我们只需更改名称,添加"sl"列,然后使用pd.concat. 最后一位只是匹配您想要的顺序。
Final result:
最后结果:
>>> df2
     vehicle  color sl  sl_val  BL_val1  BL_val2  BL_val3
resp                                                     
1       bike  green  a       5        3        4        5
1       bike  green  b       4        3        5        3
1       bike  green  c       1      NaN      NaN      NaN
1       bike  green  d       3      NaN      NaN      NaN
2       walk    red  a       5        4        5        3
2       walk    red  b       3        3        5        4
2       walk    red  c       3      NaN      NaN      NaN
2       walk    red  d       3      NaN      NaN      NaN
3        car  green  a       4        4        3        5
3        car  green  b       2        4        5        5
3        car  green  c       3      NaN      NaN      NaN
3        car  green  d       3      NaN      NaN      NaN
4        car   blue  a       4      NaN      NaN      NaN
4        car   blue  b       5      NaN      NaN      NaN
4        car   blue  c       4        5        5        5
4        car   blue  d       4        3        3        4
5        bus  black  a       2      NaN      NaN      NaN
5        bus  black  b       4        2        3        3
5        bus  black  c       4        2        2        1
5        bus  black  d       3      NaN      NaN      NaN
6        car    red  a       4        3        4        4
6        car    red  b       2      NaN      NaN      NaN
6        car    red  c       3        4        4        4
6        car    red  d       3      NaN      NaN      NaN
7        bus   blue  a       5        3        3        5
7        bus   blue  b       5        4        3        2
7        bus   blue  c       2      NaN      NaN      NaN
7        bus   blue  d       3      NaN      NaN      NaN
8       walk    red  a       3      NaN      NaN      NaN
8       walk    red  b       3        5        5        5
8       walk    red  c       4        5        3        3
8       walk    red  d       3      NaN      NaN      NaN
9        car   blue  a       5        3        3        3
9        car   blue  b       3      NaN      NaN      NaN
9        car   blue  c       4        4        3        4
9        car   blue  d       3      NaN      NaN      NaN
回答by Andy Hayden
A naive way would write a helper function to extract the relevant subDataFrames:
一种天真的方法是编写一个辅助函数来提取相关的子数据帧:
In [11]: def get_letter(df, letter):
             res = df.loc[:, ['vehicle', 'color', letter] + [letter.upper() + str(i) for i in xrange(1, 4)]]
             res.columns = ['vehicle', 'color', 'sl_val', 'BL_val1', 'BL_val2', 'BL_val3']
             res['sl'] = letter
             return res
In [12]: get_letter(df, 'a')
Out[12]: 
     vehicle  color  sl_val  BL_val1  BL_val2  BL_val3 sl
resp                                                     
1       bike  green       5        3        4        5  a
2       walk    red       5        4        5        3  a
3        car  green       4        4        3        5  a
4        car   blue       4      NaN      NaN      NaN  a
5        bus  black       2      NaN      NaN      NaN  a
6        car    red       4        3        4        4  a
7        bus   blue       5        3        3        5  a
8       walk    red       3      NaN      NaN      NaN  a
9        car   blue       5        3        3        3  a
Then concating these gets the result you're after (possibly with a sort):
然后连接这些会得到你想要的结果(可能是一种排序):
In [13]: pd.concat(get_letter(df, letter) for letter in 'abcd')
In [14]: pd.concat(get_letter(df, letter) for letter in 'abcd').sort()

