Python pandas - 根据列值合并几乎重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36271413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - Merge nearly duplicate rows based on column value
提问by Matthew Rosenthal
I have a pandas
dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.
我有一个pandas
包含几行的数据框,除了一个值之外,这些行几乎彼此重复。我的目标是将这些行合并或“合并”为一行,而不对数值求和。
Here is an example of what I'm working with:
这是我正在使用的示例:
Name Sid Use_Case Revenue
A xx01 Voice .00
A xx01 SMS .00
B xx02 Voice .00
C xx03 Voice .00
C xx03 SMS .00
C xx03 Video .00
And here is what I would like:
这是我想要的:
Name Sid Use_Case Revenue
A xx01 Voice, SMS .00
B xx02 Voice .00
C xx03 Voice, SMS, Video .00
The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".
我不想总结“收入”列的原因是因为我的表格是在几个时间段内进行透视的结果,其中“收入”最终会被多次列出,而不是每个“用例”都有不同的值.
What would be the best way to tackle this issue? I've looked into the groupby()
function but I still don't understand it very well.
解决这个问题的最佳方法是什么?我已经研究了这个groupby()
函数,但我仍然不太了解它。
回答by jezrael
I think you can use groupby
with aggregate
first
and custom function ', '.join
:
我认为您可以使用groupby
with和自定义函数:aggregate
first
', '.join
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS .00
1 B xx02 Voice .00
2 C xx03 Voice, SMS, Video .00
Nice idea from comment, thanks Goyo:
来自评论的好主意,谢谢Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS .00
1 B xx02 Voice .00
2 C xx03 Voice, SMS, Video .00
回答by Eric Ed Lohmar
I was using some code that I didn't think was optimal and eventually found jezrael's answer. But after using it and running a timeit
test, I actually went back to what I was doing, which was:
我使用了一些我认为不是最佳的代码,最终找到了jezrael 的答案。但是在使用它并运行timeit
测试之后,我实际上又回到了我正在做的事情上,那就是:
cmnts = {}
for i, row in df.iterrows():
while True:
try:
if row['Use_Case']:
cmnts[row['Name']].append(row['Use_Case'])
else:
cmnts[row['Name']].append('n/a')
break
except KeyError:
cmnts[row['Name']] = []
df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my 100 run timeit
test, the iterate and replace method is an order of magnitude faster than the groupby
method.
根据我的 100 次运行timeit
测试,迭代和替换方法比该groupby
方法快一个数量级。
import pandas as pd
from my_stuff import time_something
df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
'b': [i for i in range(1, 10001)]})
runs = 100
interim_dict = 'txt = {}\n' \
'for i, row in df.iterrows():\n' \
' try:\n' \
" txt[row['a']].append(row['b'])\n\n" \
' except KeyError:\n' \
" txt[row['a']] = []\n" \
"df.drop_duplicates('a', inplace=True)\n" \
"df['b'] = ['; '.join(v) for v in txt.values()]"
grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
yields:
产量:
Interim Dict
Total: 59.1164s
Avg: 591163748.5887ns
Group By
Total: 430.6203s
Avg: 4306203366.1827ns
where time_something
is a function which times a snippet with timeit
and returns the result in the above format.
wheretime_something
是一个函数,它对片段进行计时timeit
并以上述格式返回结果。
回答by Ami Tavory
You can groupby
and apply
the list
function:
你能groupby
和apply
的list
功能:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
Name Sid Revenue 0
0 A xx01 .00 [Voice, SMS]
1 B xx02 .00 [Voice]
2 C xx03 .00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use set
instead of list
.)
(如果您担心重复,请使用set
代替list
。)