Python Pandas——合并大部分重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17006476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas -- merging mostly duplicated rows
提问by severian
Some of my data looks like:
我的一些数据看起来像:
date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,,
1/1/2001,ABC,,,2,
1/1/2001,ABC,,,,35
I am trying to get to the point where I can run
我正在努力达到可以奔跑的地步
data.set_index(['date', 'name'])
But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).
但是,对于原样的数据,当然有重复项(如上所示),所以我不能这样做(而且我不想要有重复项的索引,而且我不能简单地 drop_duplicates(),因为这会丢失数据)。
I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at
我希望能够将具有相同 [date, name] 值的行强制为单行,如果它们可以基于某些值为 NaN 的值成功收敛(类似于 combine_first() 的行为)。例如,以上将在
date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,2,35
If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).
如果两个值不同并且一个不是 NaN,则两行不应收敛(这可能是我需要跟进的错误)。
(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)
(为了扩展上面的例子,实际上可能有任意数量的行——给定任意数量的列——它们应该能够收敛为一条线。)
This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.
这感觉是一个应该可以通过 Pandas 解决的问题,但我很难找到一个优雅的解决方案。
采纳答案by Jeff Tratner
Let's imagine you have some function combine_itthat, given a set of rows that would have duplicate values, returns a single row. First, group by dateand name:
假设您有一些函数combine_it,如果给定一组具有重复值的行,则返回单行。首先,按date和分组name:
grouped = data.groupby(['date', 'name'])
Then just apply the aggregation function and boomyou're done:
然后只需应用聚合函数和繁荣你就完成了:
result = grouped.agg(combine_it)
You can also provide different aggregation functions for different columns by passing agga dict.
您还可以通过传递aggdict为不同的列提供不同的聚合函数。
回答by Abu Shoeb
Since your column-wise values are not repeated then you can use the trick of aggfunction like this:
由于您的按列值不重复,因此您可以使用这样的agg函数技巧:
data.groupby(['date', 'name']).agg('sum')
回答by Philipp Schwarz
If you do not have numeric field values, aggregating with count, min, sum etc. will not be neither possible nor sensible. Nevertheless, you still may want to collapse duplicate records to individual records (e.g.) based on one or more primary keys.
如果您没有数字字段值,则使用 count、min、sum 等进行聚合既不可能也不明智。尽管如此,您仍可能希望根据一个或多个主键将重复记录折叠为单个记录(例如)。
# Firstly, avoid Nan values in the columns you are grouping on!
df[['col1', 'col2']] = df[['col1', 'col2']].fillna('null')
# Define your own customized operation in pandas agg() function
df = df.groupby(['col1', 'col2']).agg({'SEARCH_TERM':lambda x: ', '.join(tuple(x.tolist())),
'HITS_CONTENT':lambda x: ', '.join(tuple(x.tolist()))}
)
Group by one or more columns and collapse values values by converting them first, to list, then to tuple and finally to string. If you prefer you can also keep them as list or tuple stored in each field or apply with the agg. function and a dictionary very different operations to different columns.
按一个或多个列分组并折叠值,首先将它们转换为列表,然后转换为元组,最后转换为字符串。如果您愿意,您也可以将它们作为列表或元组存储在每个字段中,或者使用 agg 应用。函数和字典对不同列的操作非常不同。

