pandas 合并具有相同列值的连续行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25147091/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Combine Consecutive Rows with the Same column values
提问by user3314418
I have something that looks like this. How do I go from this:
我有一个看起来像这样的东西。我如何从这里开始:
0 d
0 The DT
1 Skoll ORGANIZATION
2 Foundation ORGANIZATION
3 , ,
4 based VBN
5 in IN
6 Silicon LOCATION
7 Valley LOCATION
to this:
对此:
0 d
0 The DT
1 Skoll Foundation ORGANIZATION
3 , ,
4 based VBN
5 in IN
6 Silicon Valley LOCATION
回答by chrisb
@rfan's answer of course works, as an alternative, here's an approach using pandas groupby.
@rfan 的答案当然有效,作为替代方案,这里有一种使用 pandas groupby的方法。
The .groupby()groups the data by the 'b' column - the sort=Falseis necessary to keep the order intact. The .apply()applies a function to each group of b data, in this case joining the string together separated by spaces.
该.groupby()基团通过“B”列中的数据-的sort=False是需要保持顺序不变。将.apply()函数应用于每组 b 数据,在这种情况下,将字符串连接在一起,以空格分隔。
In [67]: df.groupby('b', sort=False)['a'].apply(' '.join)
Out[67]:
b
DT The
Org Skoll Foundation
, ,
VBN based
IN in
Location Silicon Valley
Name: a, dtype: object
EDIT:
编辑:
To handle the more general case (repeated non-consecutive values) - an approach would be to first add a sentinel column that tracks which group of consecutive data each row applies to, like this:
为了处理更一般的情况(重复的非连续值) - 一种方法是首先添加一个标记列来跟踪每行适用于哪组连续数据,如下所示:
df['key'] = (df['b'] != df['b'].shift(1)).astype(int).cumsum()
Then add the key to the groupby and it should work even with repeated values. For example, with this dummy data with repeats:
然后将密钥添加到 groupby 中,即使使用重复的值,它也应该可以工作。例如,对于这个带有重复的虚拟数据:
df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',',
'based', 'in', 'Silicon', 'Valley', 'A', 'Foundation'],
'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN',
'Location', 'Location', 'Org', 'Org']})
Applying the groupby:
应用 groupby:
In [897]: df.groupby(['key', 'b'])['a'].apply(' '.join)
Out[897]:
key b
1 DT The
2 Org Skoll Foundation
3 , ,
4 VBN based
5 IN in
6 Location Silicon Valley
7 Org A Foundation
Name: a, dtype: object
回答by Roger Fan
I actually think the groupby solution by @chrisb is better, but you would need to create another groupby key variable to track non-consecutive repeated values if those are potentially present. This works as a quick-and-dirty for smaller problems though.
我实际上认为@chrisb 的 groupby 解决方案更好,但是您需要创建另一个 groupby 关键变量来跟踪可能存在的非连续重复值。不过,这对于较小的问题来说是一种快速而肮脏的方法。
I think this is a situation where it's easier to work with basic iterators, rather than try to use pandas functions. I can imagine a situation using groupby, but it seems difficult to maintain the consecutive condition if the second variable repeats.
我认为在这种情况下,使用基本迭代器更容易,而不是尝试使用 Pandas 函数。我可以想象使用 groupby 的情况,但如果第二个变量重复,似乎很难保持连续条件。
This can probably be cleaned up, but a sample:
这可能可以清理,但一个示例:
df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',',
'based', 'in', 'Silicon', 'Valley'],
'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN',
'Location', 'Location']})
# Initialize result lists with the first row of df
result1 = [df['a'][0]]
result2 = [df['b'][0]]
# Use zip() to iterate over the two columns of df simultaneously,
# making sure to skip the first row which is already added
for a, b in zip(df['a'][1:], df['b'][1:]):
if b == result2[-1]: # If b matches the last value in result2,
result1[-1] += " " + a # add a to the last value of result1
else: # Otherwise add a new row with the values
result1.append(a)
result2.append(b)
# Create a new dataframe using these result lists
df = DataFrame({'a': result1, 'b': result2})

