pandas 用 groupby 方法替换值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14760757/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replacing values with groupby means
提问by Def_Os
I have a DataFrame with a column that has some bad data with various negative values. I would like to replace values < 0 with the mean of the group that they are in.
我有一个 DataFrame,其中有一列包含一些带有各种负值的坏数据。我想用它们所在的组的平均值替换 <0 的值。
For missing values as NAs, I would do:
对于作为 NA 的缺失值,我会这样做:
data = df.groupby(['GroupID']).column
data.transform(lambda x: x.fillna(x.mean()))
But how to do this operation on a condition like x < 0?
但是如何在类似的条件下进行此操作x < 0?
Thanks!
谢谢!
采纳答案by unutbu
Using @AndyHayden's example, you could use groupby/transformwith replace:
使用@AndyHayden 的示例,您可以将groupby/transform与replace:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
print(df)
# a b
# 0 1 1
# 1 1 -1
# 2 2 1
# 3 2 2
data = df.groupby(['a'])
def replace(group):
mask = group<0
# Select those values where it is < 0, and replace
# them with the mean of the values which are not < 0.
group[mask] = group[~mask].mean()
return group
print(data.transform(replace))
# b
# 0 1
# 1 1
# 2 1
# 3 2
回答by Andy Hayden
Here's one way to do it (for the 'b'column, in this boring example):
这是一种方法(对于'b'列,在这个无聊的例子中):
In [1]: df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
In [2]: df
Out[2]:
a b
0 1 1
1 1 -1
2 2 1
3 2 2
Replace those negative values with NaN, and then calculate the mean (b) in each group:
用 NaN 替换那些负值,然后计算b每组中的平均值 ( ):
In [3]: df['b'] = df.b.apply(lambda x: x if x>=0 else pd.np.nan)
In [4]: m = df.groupby('a').mean().b
Then use applyacross each row, to replace each NaN with its groups mean:
然后apply在每一行中使用,用它的组替换每个 NaN 意味着:
In [5]: df['b'] = df.apply(lambda row: m[row['a']]
if pd.isnull(row['b'])
else row['b'],
axis=1)
In [6]: df
Out[6]:
a b
0 1 1
1 1 1
2 2 1
3 2 2
回答by YOBEN_S
There is a great Example, for your additional question.
对于您的其他问题,有一个很好的示例。
df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
gb = df.groupby('A')
def replace(g):
mask = g < 0
g.loc[mask] = g[~mask].mean()
return g
gb.transform(replace)
Link: http://pandas.pydata.org/pandas-docs/stable/cookbook.html
链接:http: //pandas.pydata.org/pandas-docs/stable/cookbook.html
回答by solub
I had the same issue and came up with a rather simple solution
我遇到了同样的问题,并提出了一个相当简单的解决方案
func = lambda x : np.where(x < 0, x.mean(), x)
df['Bad_Column'].transform(func)
Note that if you want to return the mean of the correct values (mean based on positive values only) you'd have to specify:
请注意,如果您想返回正确值的平均值(仅基于正值的平均值),您必须指定:
func = lambda x : np.where(x < 0, x.mask(x < 0).mean(), x)

