Python pandas - 根据列值合并几乎重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36271413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - Merge nearly duplicate rows based on column value
提问by Matthew Rosenthal
I have a pandasdataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.
我有一个pandas包含几行的数据框,除了一个值之外,这些行几乎彼此重复。我的目标是将这些行合并或“合并”为一行,而不对数值求和。
Here is an example of what I'm working with:
这是我正在使用的示例:
Name Sid Use_Case Revenue
A xx01 Voice .00
A xx01 SMS .00
B xx02 Voice .00
C xx03 Voice .00
C xx03 SMS .00
C xx03 Video .00
And here is what I would like:
这是我想要的:
Name Sid Use_Case Revenue
A xx01 Voice, SMS .00
B xx02 Voice .00
C xx03 Voice, SMS, Video .00
The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".
我不想总结“收入”列的原因是因为我的表格是在几个时间段内进行透视的结果,其中“收入”最终会被多次列出,而不是每个“用例”都有不同的值.
What would be the best way to tackle this issue? I've looked into the groupby()function but I still don't understand it very well.
解决这个问题的最佳方法是什么?我已经研究了这个groupby()函数,但我仍然不太了解它。
回答by jezrael
I think you can use groupbywith aggregatefirstand custom function ', '.join:
我认为您可以使用groupbywith和自定义函数:aggregatefirst', '.join
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS .00
1 B xx02 Voice .00
2 C xx03 Voice, SMS, Video .00
Nice idea from comment, thanks Goyo:
来自评论的好主意,谢谢Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS .00
1 B xx02 Voice .00
2 C xx03 Voice, SMS, Video .00
回答by Eric Ed Lohmar
I was using some code that I didn't think was optimal and eventually found jezrael's answer. But after using it and running a timeittest, I actually went back to what I was doing, which was:
我使用了一些我认为不是最佳的代码,最终找到了jezrael 的答案。但是在使用它并运行timeit测试之后,我实际上又回到了我正在做的事情上,那就是:
cmnts = {}
for i, row in df.iterrows():
while True:
try:
if row['Use_Case']:
cmnts[row['Name']].append(row['Use_Case'])
else:
cmnts[row['Name']].append('n/a')
break
except KeyError:
cmnts[row['Name']] = []
df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my 100 run timeittest, the iterate and replace method is an order of magnitude faster than the groupbymethod.
根据我的 100 次运行timeit测试,迭代和替换方法比该groupby方法快一个数量级。
import pandas as pd
from my_stuff import time_something
df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
'b': [i for i in range(1, 10001)]})
runs = 100
interim_dict = 'txt = {}\n' \
'for i, row in df.iterrows():\n' \
' try:\n' \
" txt[row['a']].append(row['b'])\n\n" \
' except KeyError:\n' \
" txt[row['a']] = []\n" \
"df.drop_duplicates('a', inplace=True)\n" \
"df['b'] = ['; '.join(v) for v in txt.values()]"
grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
yields:
产量:
Interim Dict
Total: 59.1164s
Avg: 591163748.5887ns
Group By
Total: 430.6203s
Avg: 4306203366.1827ns
where time_somethingis a function which times a snippet with timeitand returns the result in the above format.
wheretime_something是一个函数,它对片段进行计时timeit并以上述格式返回结果。
回答by Ami Tavory
You can groupbyand applythe listfunction:
你能groupby和apply的list功能:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
Name Sid Revenue 0
0 A xx01 .00 [Voice, SMS]
1 B xx02 .00 [Voice]
2 C xx03 .00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use setinstead of list.)
(如果您担心重复,请使用set代替list。)

