pandas 尝试修改pandas groupby的列值时出现“ValueError:值的长度与索引的长度不匹配”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46446956/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:32:12  来源:igfitidea点击:

"ValueError: Length of values does not match length of index" when trying to modify column values a pandas groupby

pythonpandasdataframegroup-bypandas-groupby

提问by cs95

I have a dataframe:

我有一个数据框:

       A         C         D
0    one  0.410599 -0.205158
1    one  0.144044  0.313068
2    one  0.333674 -0.742165
3  three  0.761038 -2.552990
4  three  1.494079  2.269755
5    two  1.454274 -0.854096
6    two  0.121675  0.653619
7    two  0.443863  0.864436

Let's assume that Ais the anchor column. I now want to display each group value only once, at the top:

让我们假设这A是锚列。我现在只想在顶部显示每个组值一次

        A         C         D
0    one  0.410599 -0.205158
1         0.144044  0.313068
2         0.333674 -0.742165
3  three  0.761038 -2.552990
4         1.494079  2.269755
5    two  1.454274 -0.854096
6         0.121675  0.653619
7         0.443863  0.864436

This is what I've come up with:

这是我想出的:

df['A'] = df.groupby('A', as_index=False)['A']\
        .apply(lambda x: x.str.replace('.*', '').set_value(0, x.values[0])).values

My strategy was to do a groupby and then set all values to an empty string other than the first. This doesn't seem to work, because I get:

我的策略是做一个 groupby,然后将所有值设置为第一个以外的空字符串。这似乎不起作用,因为我得到:

ValueError: Length of values does not match length of index

Which means that the output I get is incorrect. Any ideas/suggestions/improvements welcome.

这意味着我得到的输出不正确。欢迎任何想法/建议/改进。

I should add that I am trying to generalise a solution that can single out values at the top OR bottom OR middle of each group, so I'd give more preference to a solution that helps me do that (to understand, the example above shows how to single out values only at the top of each group, however, I want to generalise a solution that allows me to single them out at the bottom or in the middle).

我应该补充一点,我正在尝试概括一个解决方案,该解决方案可以在每个组的顶部或底部或中间挑选出值,因此我更倾向于帮助我做到这一点的解决方案(要理解,上面的示例显示如何仅在每个组的顶部挑出值,但是,我想概括一个解决方案,允许我在底部或中间挑出它们)。

回答by Bharath

Your method didn't work because of the index error. When you groupby 'A', the index is represented the same way in the grouped data too. Since set_value(0)could not find the correct index, it creates a new objectwith that index. That's the reason why there was a length mismatch.

由于索引错误,您的方法无效。当您按“A”分组时,索引在分组数据中的表示方式也相同。由于set_value(0)找不到正确的索引,它使用该索引创建一个新对象。这就是长度不匹配的原因。

Fix 1
reset_index(drop=True)

修复 1
reset_index(drop=True)

df['A'] = df.groupby('A')['A'].apply(lambda x: x.str.replace('.*', '')\
                      .reset_index(drop=True).set_value(0, x.values[0])).values
df

      A         C         D
0    one  0.410599 -0.205158
1         0.144044  0.313068
2         0.333674 -0.742165
3  three  0.761038 -2.552990
4         1.494079  2.269755
5    two  1.454274 -0.854096
6         0.121675  0.653619
7         0.443863  0.864436


Fix 2
set_value

修复 2
set_value

set_valuehas a 3rd parameter called takeablewhich determines how the index is treated. It is Falseby default, but setting it to Trueworked for my case.

set_value有一个调用的第三个参数takeable,它决定如何处理索引。这是False默认,但它设置为True我的情况下工作。

In addition to Zero's solutions, the solution for isolating values at the centre of their groups is as follows:

除了Zero 的解决方案之外,在其组的中心隔离值的解决方案如下:

df.A = df.groupby('A'['A'].apply(lambda x: x.str.replace('.*', '')\
                           .set_value(len(x) // 2, x.values[0], True)).values 

df

       A         C         D
0         0.410599 -0.205158
1    one  0.144044  0.313068
2         0.333674 -0.742165
3         0.761038 -2.552990
4  three  1.494079  2.269755
5         1.454274 -0.854096
6    two  0.121675  0.653619
7         0.443863  0.864436

回答by Zero

Since the values are sorted, use the duplicatedmethod for the first and last cases.

由于值已排序,因此duplicated对第一种和最后一种情况使用该方法。



Keep First

保持第一

In [4233]: df.loc[df.A.duplicated(keep='first'), 'A'] = ''

In [4234]: df
Out[4234]:
       A         C         D
0    one  0.410599 -0.205158
1         0.144044  0.313068
2         0.333674 -0.742165
3  three  0.761038 -2.552990
4         1.494079  2.269755
5    two  1.454274 -0.854096
6         0.121675  0.653619
7         0.443863  0.864436


Keep Last

保持最后

In [4236]: df.loc[df.A.duplicated(keep='last'), 'A'] = ''

In [4237]: df
Out[4237]:
       A         C         D
0         0.410599 -0.205158
1         0.144044  0.313068
2    one  0.333674 -0.742165
3         0.761038 -2.552990
4  three  1.494079  2.269755
5         1.454274 -0.854096
6         0.121675  0.653619
7    two  0.443863  0.864436