Python pandas - 根据列值合并几乎重复的行

Question

提问by Matthew Rosenthal

I have a pandasdataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.

我有一个pandas包含几行的数据框，除了一个值之外，这些行几乎彼此重复。我的目标是将这些行合并或“合并”为一行，而不对数值求和。

Here is an example of what I'm working with:

这是我正在使用的示例：

Name   Sid   Use_Case  Revenue
A      xx01  Voice     .00
A      xx01  SMS       .00
B      xx02  Voice     .00
C      xx03  Voice     .00
C      xx03  SMS       .00
C      xx03  Video     .00

And here is what I would like:

这是我想要的：

Name   Sid   Use_Case            Revenue
A      xx01  Voice, SMS          .00
B      xx02  Voice               .00
C      xx03  Voice, SMS, Video   .00

The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".

我不想总结“收入”列的原因是因为我的表格是在几个时间段内进行透视的结果，其中“收入”最终会被多次列出，而不是每个“用例”都有不同的值.

What would be the best way to tackle this issue? I've looked into the groupby()function but I still don't understand it very well.

解决这个问题的最佳方法是什么？我已经研究了这个groupby()函数，但我仍然不太了解它。

Answer 1

回答by jezrael

I think you can use groupbywith aggregatefirstand custom function ', '.join:

我认为您可以使用groupbywith和自定义函数：aggregatefirst', '.join

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  .00
1    B  xx02              Voice   .00
2    C  xx03  Voice, SMS, Video  .00

Nice idea from comment, thanks Goyo:

来自评论的好主意，谢谢Goyo：

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  .00
1    B  xx02              Voice   .00
2    C  xx03  Voice, SMS, Video  .00

Answer 2

回答by Eric Ed Lohmar

I was using some code that I didn't think was optimal and eventually found jezrael's answer. But after using it and running a timeittest, I actually went back to what I was doing, which was:

我使用了一些我认为不是最佳的代码，最终找到了jezrael 的答案。但是在使用它并运行timeit测试之后，我实际上又回到了我正在做的事情上，那就是：

cmnts = {}
for i, row in df.iterrows():
    while True:
        try:
            if row['Use_Case']:
                cmnts[row['Name']].append(row['Use_Case'])

            else:
                cmnts[row['Name']].append('n/a')

            break

        except KeyError:
            cmnts[row['Name']] = []

df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]

According to my 100 run timeittest, the iterate and replace method is an order of magnitude faster than the groupbymethod.

根据我的 100 次运行timeit测试，迭代和替换方法比该groupby方法快一个数量级。

import pandas as pd
from my_stuff import time_something

df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                   'b': [i for i in range(1, 10001)]})

runs = 100

interim_dict = 'txt = {}\n' \
               'for i, row in df.iterrows():\n' \
               '    try:\n' \
               "        txt[row['a']].append(row['b'])\n\n" \
               '    except KeyError:\n' \
               "        txt[row['a']] = []\n" \
               "df.drop_duplicates('a', inplace=True)\n" \
               "df['b'] = ['; '.join(v) for v in txt.values()]"

grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"

print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))

yields:

产量：

Interim Dict
  Total: 59.1164s
  Avg: 591163748.5887ns

Group By
  Total: 430.6203s
  Avg: 4306203366.1827ns

where time_somethingis a function which times a snippet with timeitand returns the result in the above format.

wheretime_something是一个函数，它对片段进行计时timeit并以上述格式返回结果。

Answer 3

回答by Ami Tavory

You can groupbyand applythe listfunction:

你能groupby和apply的list功能：

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
    Name    Sid     Revenue     0
0   A   xx01    .00  [Voice, SMS]
1   B   xx02    .00   [Voice]
2   C   xx03    .00  [Voice, SMS, Video]

(In case you are concerned about duplicates, use setinstead of list.)

（如果您担心重复，请使用set代替list。）

Python pandas - 根据列值合并几乎重复的行

提问by Matthew Rosenthal

回答by jezrael

回答by Eric Ed Lohmar

回答by Ami Tavory

相关推荐

最近更新

标签

Python pandas - 根据列值合并几乎重复的行

提问by Matthew Rosenthal

回答by jezrael

回答by Eric Ed Lohmar

回答by Ami Tavory

相关推荐

dump() 缺少 1 个必需的位置参数：python json 中的“fp”

如何在 Python 中异或两个字符串

Python 从时间中减去小时和分钟

Python LabelEncoder：类型错误：“float”和“str”的实例之间不支持“>”

相关推荐

最近更新

标签