pandas 合并具有相同列值的连续行

Question

提问by user3314418

I have something that looks like this. How do I go from this:

我有一个看起来像这样的东西。我如何从这里开始：

    0             d
0   The         DT
1   Skoll       ORGANIZATION
2   Foundation  ORGANIZATION
3   ,           ,
4   based       VBN
5   in          IN
6   Silicon     LOCATION
7   Valley      LOCATION

to this:

对此：

    0                       d
0   The                     DT
1   Skoll Foundation        ORGANIZATION
3   ,                       ,
4   based                   VBN
5   in                      IN
6   Silicon Valley          LOCATION

Answer 1

回答by chrisb

@rfan's answer of course works, as an alternative, here's an approach using pandas groupby.

@rfan 的答案当然有效，作为替代方案，这里有一种使用 pandas groupby的方法。

The .groupby()groups the data by the 'b' column - the sort=Falseis necessary to keep the order intact. The .apply()applies a function to each group of b data, in this case joining the string together separated by spaces.

该.groupby()基团通过“B”列中的数据-的sort=False是需要保持顺序不变。将.apply()函数应用于每组 b 数据，在这种情况下，将字符串连接在一起，以空格分隔。

In [67]: df.groupby('b', sort=False)['a'].apply(' '.join)
Out[67]: 

b
DT                       The
Org         Skoll Foundation
,                          ,
VBN                    based
IN                        in
Location      Silicon Valley
Name: a, dtype: object

EDIT:

编辑：

To handle the more general case (repeated non-consecutive values) - an approach would be to first add a sentinel column that tracks which group of consecutive data each row applies to, like this:

为了处理更一般的情况（重复的非连续值） - 一种方法是首先添加一个标记列来跟踪每行适用于哪组连续数据，如下所示：

df['key'] = (df['b'] != df['b'].shift(1)).astype(int).cumsum()

Then add the key to the groupby and it should work even with repeated values. For example, with this dummy data with repeats:

然后将密钥添加到 groupby 中，即使使用重复的值，它也应该可以工作。例如，对于这个带有重复的虚拟数据：

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley', 'A', 'Foundation'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location', 'Org', 'Org']})

Applying the groupby:

应用 groupby：

In [897]: df.groupby(['key', 'b'])['a'].apply(' '.join)
Out[897]: 
key  b       
1    DT                       The
2    Org         Skoll Foundation
3    ,                          ,
4    VBN                    based
5    IN                        in
6    Location      Silicon Valley
7    Org             A Foundation
Name: a, dtype: object

Answer 2

回答by Roger Fan

I actually think the groupby solution by @chrisb is better, but you would need to create another groupby key variable to track non-consecutive repeated values if those are potentially present. This works as a quick-and-dirty for smaller problems though.

我实际上认为@chrisb 的 groupby 解决方案更好，但是您需要创建另一个 groupby 关键变量来跟踪可能存在的非连续重复值。不过，这对于较小的问题来说是一种快速而肮脏的方法。

I think this is a situation where it's easier to work with basic iterators, rather than try to use pandas functions. I can imagine a situation using groupby, but it seems difficult to maintain the consecutive condition if the second variable repeats.

我认为在这种情况下，使用基本迭代器更容易，而不是尝试使用 Pandas 函数。我可以想象使用 groupby 的情况，但如果第二个变量重复，似乎很难保持连续条件。

This can probably be cleaned up, but a sample:

这可能可以清理，但一个示例：

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location']})

# Initialize result lists with the first row of df
result1 = [df['a'][0]]  
result2 = [df['b'][0]]

# Use zip() to iterate over the two columns of df simultaneously,
# making sure to skip the first row which is already added
for a, b in zip(df['a'][1:], df['b'][1:]):
    if b == result2[-1]:        # If b matches the last value in result2,
        result1[-1] += " " + a  # add a to the last value of result1
    else:  # Otherwise add a new row with the values
        result1.append(a)
        result2.append(b)

# Create a new dataframe using these result lists
df = DataFrame({'a': result1, 'b': result2})

pandas 合并具有相同列值的连续行

提问by user3314418

回答by chrisb

回答by Roger Fan

相关推荐

最近更新

标签

pandas 合并具有相同列值的连续行

提问by user3314418

回答by chrisb

回答by Roger Fan

相关推荐

pandas 使用 XlsxWriter 将熊猫图表插入到 Excel 文件中

pandas 向 MultiIndex DataFrame/Series 添加一行

pandas 为熊猫数据帧中的整数格式化千位分隔符

pandas 键错误和 MultiIndex 词法排序深度

相关推荐

最近更新

标签