pandas 合并 DataFrame 中的重复列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13078751/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Combine duplicated columns within a DataFrame
提问by Kyle Brandt
If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
如果我的数据框包含包含相同名称的列,是否可以将具有相同名称的列与某种函数(即 sum)组合在一起?
For instance with:
例如:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
我如何通过对列名称相同的每一行求和来折叠 NY-WEB01 列(有一堆重复的列,而不仅仅是 NY-WEB01)?
回答by meteore
I believe this does what you are after:
我相信这可以满足您的要求:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
或者,根据 df 的长度,速度提高 3% 到 15%:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg()(short for .aggregate()):
编辑:要将其扩展到总和之外,请使用.agg()(简称.aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
回答by cs95
v0.20+ Answer: GroupBywith leveland axisargument
v0.20+ 答案:GroupBywithlevel和axisargument
You don't need a lambda here, nor do you explicitly have to query df.columns; groupbyaccepts a levelargument you can specify in conjunction with the axisargument. This is cleaner, IMO.
此处不需要 lambda,也不需要明确查询df.columns;groupby接受一个level可以与axis参数一起指定的参数。这更干净,IMO。
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndexcolumns
处理MultiIndex列
Another case to consider is when dealing with MultiIndexcolumns. Consider
另一个需要考虑的情况是处理MultiIndex列时。考虑
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
要跨上层执行聚合,请使用
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
或者,如果仅按上层聚合,请使用
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
替代解释:删除重复的列
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
如果您来这里是想了解如何简单地删除重复的列(不执行任何聚合),请使用Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last'(default is 'first'),
或者,要保留最后一个,请指定keep='last'(默认为'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupbyalternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
groupby上述两种解决方案的替代方案分别是df.groupby(level=0, axis=1).first()、 和... .last()。
回答by jezrael
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std- only use parameters axis=1for working with columns and level:
这是常见聚合函数的可能更简单的解决方案,例如sum, mean, median, max, min, std- 仅使用参数axis=1来处理列和level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0instead axis=1:
类似它的索引工作,然后使用axis=0,而不是axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, countis necessary use coldspeed answer
如果需要使用其他功能,如first, last, size,count则必须使用coldspeed answer

