复杂的(对我来说)在 Pandas 中从宽到长的重塑

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17688155/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:59:36  来源:igfitidea点击:

Complicated (for me) reshaping from wide to long in Pandas

pythonpandasreshape

提问by cd98

Individuals (indexed from 0 to 5) choose between two locations: A and B. My data has a wide format containing characteristics that vary by individual (ind_var) and characteristics that vary only by location (location_var).

个人(索引从 0 到 5)在两个位置之间进行选择:A 和 B。我的数据具有广泛的格式,其中包含因个人而异的特征 (ind_var) 和仅因位置而异的特征 (location_var)。

For example, I have:

例如,我有:

In [281]:

df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

df_reshape_test

Out[281]:
    dist_to_A   dist_to_B   ind_var location location_var
0    0            50             3   A       10
1    0            50             8   A       10
2    0            50            10   A       10
3    50           0              1   B       14
4    50           0              3   B       14
5    50           0              4   B       14

The variable 'location' is the one chosen by the individual. dist_to_A is the distance to location A from the location chosen by the individual (same thing with dist_to_B)

变量“位置”是个人选择的变量。dist_to_A 是从个人选择的位置到位置 A 的距离(与 dist_to_B 相同)

I'd like my data to have this form:

我希望我的数据具有以下形式:

    choice  dist_S  ind_var location    location_var
0    1        0       3         A           10
0    0       50       3         B           14
1    1        0       8         A           10
1    0       50       8         B           14
2    1        0      10         A           10
2    0       50      10         B           14
3    0       50       1         A           10
3    1        0       1         B           14
4    0       50       3         A           10
4    1        0       3         B           14
5    0       50       4         A           10
5    1        0       4         B           14

where choice == 1 indicates individual has chosen that location and dist_S is the distance from the location chosen.

其中 choice == 1 表示个人选择了该位置, dist_S 是与所选位置的距离。

I read about the .stackmethod but couldn't figure out how to apply it for this case. Thanks for your time!

我阅读了.stack方法,但无法弄清楚如何将它应用于这种情况。谢谢你的时间!

NOTE: this is just a simple example. The datasets I'm looking have varying numbers of location and number of individuals per location, so I'm looking for a flexible solution if possible

注意:这只是一个简单的例子。我正在查看的数据集具有不同数量的位置和每个位置的个人数量,因此如果可能,我正在寻找灵活的解决方案

采纳答案by Zhen Sun

In fact, pandas has a wide_to_longcommand that can conveniently do what you intend to do.

实际上,pandas 有一个wide_to_long命令可以方便地做你想做的事。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

This gives you the table you asked for:

这为您提供了您要求的表格:

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

Of course you can generate a choice variable that takes 0and 1if that's what you want.

当然,你可以生成一个选择变量,需要01如果这就是你想要的。

回答by TomAugspurger

I'm a bit curious why you'd like it in the format. There's probably a much better way to store your data. But here goes.

我有点好奇你为什么喜欢这种格式。可能有更好的方法来存储您的数据。但在这里。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

It's not going to generalize well, but there are probably alternative (better) ways to get around the uglier parts like generating the choice col.

它不会很好地概括,但可能有替代(更好)的方法来绕过丑陋的部分,例如生成选择 col。

回答by cd98

Ok, this took longer that I expected, but here's a more general answer that works with an arbitrary number of choices per individual. I'm sure there are simpler ways, so it would be great if somebody can chime in with something better for some of the following code.

好的,这比我预期的要花费更长的时间,但这里有一个更通用的答案,适用于每个人的任意数量的选择。我确信有更简单的方法,所以如果有人可以为以下某些代码提供更好的方法,那就太好了。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

which gives

这使

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

Then we do:

然后我们这样做:

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

Now I create a Multi-Index and do a re-index to get a long shape

现在我创建一个多索引并重新索引以获得长形状

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

The long shape looks like this:

长形看起来像这样:

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

So now fill the NaN values with the dictionaries:

所以现在用字典填充 NaN 值:

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

Finally, all that is left is creating dist_S
I'll cheat here and assume I can create a nested dictionary like this one

最后,剩下的就是创建 dist_S
我会在这里作弊并假设我可以创建一个像这样的嵌套字典

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

(This reads: if you're in location A, then location A is at 0 km and location B at 50 km)

(内容如下:如果您在位置 A,则位置 A 位于 0 公里,位置 B 位于 50 公里)

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

gives the desired result

给出想要的结果

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0