pandas 合并 DataFrame 中的重复列

Question

提问by Kyle Brandt

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?

如果我的数据框包含包含相同名称的列，是否可以将具有相同名称的列与某种函数（即 sum）组合在一起？

For instance with:

例如：

In [186]:

df["NY-WEB01"].head()
Out[186]:
                NY-WEB01    NY-WEB01
DateTime        
2012-10-18 16:00:00  5.6     2.8
2012-10-18 17:00:00  18.6    12.0
2012-10-18 18:00:00  18.4    12.0
2012-10-18 19:00:00  18.2    12.0
2012-10-18 20:00:00  19.2    12.0

How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?

我如何通过对列名称相同的每一行求和来折叠 NY-WEB01 列（有一堆重复的列，而不仅仅是 NY-WEB01）？

Answer 1

回答by meteore

I believe this does what you are after:

我相信这可以满足您的要求：

df.groupby(lambda x:x, axis=1).sum()

Alternatively, between 3% and 15% faster depending on the length of the df:

或者，根据 df 的长度，速度提高 3% 到 15%：

df.groupby(df.columns, axis=1).sum()

EDIT: To extend this beyond sums, use .agg()(short for .aggregate()):

编辑：要将其扩展到总和之外，请使用.agg()（简称.aggregate()）：

df.groupby(df.columns, axis=1).agg(numpy.max)

Answer 2

回答by cs95

v0.20+ Answer: `GroupBy`with `level`and `axis`argument

v0.20+ 答案：`GroupBy`with`level`和`axis`argument

You don't need a lambda here, nor do you explicitly have to query df.columns; groupbyaccepts a levelargument you can specify in conjunction with the axisargument. This is cleaner, IMO.

此处不需要 lambda，也不需要明确查询df.columns；groupby接受一个level可以与axis参数一起指定的参数。这更干净，IMO。

# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df

    A   A   B   B   B
0  44  47   0   3   3
1  39   9  19  21  36
2  23   6  24  24  12
3   1  38  39  23  46
4  24  17  37  25  13

df.groupby(level=0, axis=1).sum()

    A    B
0  91    6
1  48   76
2  29   60
3  39  108
4  41   75

Handling `MultiIndex`columns

处理`MultiIndex`列

Another case to consider is when dealing with MultiIndexcolumns. Consider

另一个需要考虑的情况是处理MultiIndex列时。考虑

df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
  one         two    
    A   A   B   B   B
0  44  47   0   3   3
1  39   9  19  21  36
2  23   6  24  24  12
3   1  38  39  23  46
4  24  17  37  25  13

To perform aggregation across the upper levels, use

要跨上层执行聚合，请使用

df.groupby(level=1, axis=1).sum()

    A    B
0  91    6
1  48   76
2  29   60
3  39  108
4  41   75

or, if aggregating per upper level only, use

或者，如果仅按上层聚合，请使用

df.groupby(level=[0, 1], axis=1).sum()

  one     two
    A   B   B
0  91   0   6
1  48  19  57
2  29  24  36
3  39  39  69
4  41  37  38

Alternate Interpretation: Dropping Duplicate Columns

替代解释：删除重复的列

If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:

如果您来这里是想了解如何简单地删除重复的列（不执行任何聚合），请使用Index.duplicated：

df.loc[:,~df.columns.duplicated()]

    A   B
0  44   0
1  39  19
2  23  24
3   1  39
4  24  37

Or, to keep the last ones, specify keep='last'(default is 'first'),

或者，要保留最后一个，请指定keep='last'（默认为'first'），

df.loc[:,~df.columns.duplicated(keep='last')]

    A   B
0  47   3
1   9  36
2   6  12
3  38  46
4  17  13

The groupbyalternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.

groupby上述两种解决方案的替代方案分别是df.groupby(level=0, axis=1).first()、和... .last()。

Answer 3

回答by jezrael

Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std- only use parameters axis=1for working with columns and level:

这是常见聚合函数的可能更简单的解决方案，例如sum, mean, median, max, min, std- 仅使用参数axis=1来处理列和level：

#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)

print (df.sum(axis=1, level=0))
    A    B
0  91    6
1  48   76
2  29   60
3  39  108
4  41   75

df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])

print (df.sum(axis=1, level=1))
    A    B
0  91    6
1  48   76
2  29   60
3  39  108
4  41   75

print (df.sum(axis=1, level=[0,1]))
  one     two
    A   B   B
0  91   0   6
1  48  19  57
2  29  24  36
3  39  39  69
4  41  37  38

Similar it working for index, then use axis=0instead axis=1:

类似它的索引工作，然后使用axis=0，而不是axis=1：

np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
    A   B   C   D   E
a  44  47   0   3   3
a  39   9  19  21  36
b  23   6  24  24  12
b   1  38  39  23  46
c  24  17  37  25  13

print (df.min(axis=0, level=0))
    A   B   C   D   E
a  39   9   0   3   3
b   1   6  24  23  12
c  24  17  37  25  13

df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])

print (df.mean(axis=0, level=1))
      A     B     C     D     E
a  41.5  28.0   9.5  12.0  19.5
b  12.0  22.0  31.5  23.5  29.0
c  24.0  17.0  37.0  25.0  13.0

print (df.max(axis=0, level=[0,1]))
        A   B   C   D   E
bar a  44  47  19  21  36
    b  23   6  24  24  12
foo b   1  38  39  23  46
    c  24  17  37  25  13

If need use another functions like first, last, size, countis necessary use coldspeed answer

如果需要使用其他功能，如first, last, size，count则必须使用coldspeed answer

pandas 合并 DataFrame 中的重复列

提问by Kyle Brandt

回答by meteore

回答by cs95

v0.20+ Answer: `GroupBy`with `level`and `axis`argument

v0.20+ 答案：`GroupBy`with`level`和`axis`argument

Handling `MultiIndex`columns

处理`MultiIndex`列

Alternate Interpretation: Dropping Duplicate Columns

替代解释：删除重复的列

回答by jezrael

相关推荐

最近更新

标签

pandas 合并 DataFrame 中的重复列

提问by Kyle Brandt

回答by meteore

回答by cs95

v0.20+ Answer: GroupBywith leveland axisargument

v0.20+ 答案：GroupBywithlevel和axisargument

Handling MultiIndexcolumns

处理MultiIndex列

Alternate Interpretation: Dropping Duplicate Columns

替代解释：删除重复的列

回答by jezrael

相关推荐

如何在不添加额外索引的情况下使用 Pandas groupby apply()

从 Pandas 数据框中过滤只有零的列

pandas 如何按一级索引分组并将函数应用于熊猫中的二级索引

pandas 添加具有不同索引的熊猫系列而不会获得 NaN

相关推荐

最近更新

标签

v0.20+ Answer: `GroupBy`with `level`and `axis`argument

v0.20+ 答案：`GroupBy`with`level`和`axis`argument

Handling `MultiIndex`columns

处理`MultiIndex`列