pandas 熊猫：如何选择每个 GROUP BY 组中的第一行？

Question

提问by ihadanny

Basically the same as Select first row in each GROUP BY group?only in pandas.

df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
                    })

Sorting looks promising:

排序看起来很有希望：

df.sort('B')

     A  B
1  foo  1
6  bar  1
2  foo  2
4  bar  2
0  foo  3
7  bar  3
3  foo  4
5  bar  4

But then first won't give the desired result... df.groupby('A').first()

但是首先不会给出想要的结果... df.groupby('A').first()

     B
A     
bar  2
foo  3

Answer 1

采纳答案by EdChum

Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sortthe df prior to performing groupby:

通常，如果您希望数据按 groupby 排序，但它不是要分组的列之一，那么sort在执行之前最好使用 df groupby：

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1

Answer 2

回答by firelynx

The pandas groupbyfunction could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.

该大PandasGROUPBY功能可以用于你想要的东西，但它确实意味着聚集。这是一个简单的“先行”操作。

What you actually want is the pandas drop_duplicatesfunction, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

你真正想要的是pandas drop_duplicates函数，默认情况下它会返回第一行。您通常会考虑 groupby 键，您应该将其作为 subset= 变量传递

df.drop_duplicates(subset='A')

Should do what you want.

应该做你想做的。

Also, df.sort('A')does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=Trueparameter.

此外，df.sort('A')不对 DataFrame df 进行排序，它返回一个已排序的副本。如果要排序，则必须添加inplace=True参数。

df.sort('A', inplace=True)

Answer 3

回答by JohnE

Here's an alternative approach using groupby().rank():

这是使用的替代方法groupby().rank()：

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

这为您提供了与@EdChum 对 OP 示例数据框的相同答案，但如果您在排序期间有任何联系，则可能会给出不同的答案，例如，使用如下数据：

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional methodargument, depending on how you wish to handle sorting ties:

在这种情况下，您有一些使用可选method参数的选项，具体取决于您希望如何处理排序关系：

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why

Answer 4

回答by Dídac Fernández

EdChum's answermay not always work as intended. Instead of first()use nth(0).

EdChum 的回答可能并不总是按预期工作。而不是first()使用nth(0)。

The method first()is affected by this bugthat has gone unsolved for some years now. Instead of the expected behaviour, first()returns the first element that is not missingin each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:

该方法first()受到这个多年来未解决的错误的影响。不是预期的行为，而是first()返回每个组内的每一列中没有丢失的第一个元素，即它忽略 NaN 值。例如，假设您有一个包含一些缺失值的第三列：

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
                   'B' : ['1', '2','2', '4', '1'],
                   'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})

    A   B   C
0   foo 1   NaN
2   foo 2   X
3   bar 2   Y
4   bar 4   Y
5   bar 1   Y

Using first()here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):

first()在这里使用（排序后，就像 EdChum 在他们的答案中正确评估一样）将跳过缺失值（注意它是如何混合来自不同行的值）：

df.sort_values('B').groupby('A').first()

    B   C
A       
bar 1   Y
foo 1   X

The correct way to get the full row, including missing values, is to use nth(0), which performs the expected operation:

获取完整行（包括缺失值）的正确方法是使用nth(0)，它执行预期的操作：

df.sort_values('B').groupby('A').nth(0)

    B   C
A       
bar 1   Y
foo 1   NaN

For completeness, this bug also affects last(), its correct substitute being nth(-1).

为了完整起见，这个错误也会影响last()，它的正确替代品是nth(-1)。

Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.

将此发布在答案中，因为评论太长了。不确定这是否在问题的范围内，但我认为它与许多寻找此答案的人有关（就像我在写这篇文章之前一样）并且非常容易错过。

Answer 5

回答by fpersyn

Typically you use GroupByif there is a need to run a computation on each group (see: split-apply-combine pattern).

通常，GroupBy如果需要对每个组运行计算，则使用（请参阅：split-apply-combine 模式）。

If you merely want to keep the first row for each unique value of a column (or a unique combination of columns) you can sort using .sort_values()(or .sort_index()) and subsequently keep each first occurence using .drop_duplicates().

如果您只想为列的每个唯一值（或列的唯一组合）保留第一行，您可以使用.sort_values()(或.sort_index()) 进行排序，然后使用.drop_duplicates().

df.sort_values('A', ascending=True).drop_duplicates('A', keep='first')

This approach gives you a non-destructive result where the initial DataFrame structure and index are maintained:

这种方法为您提供了一个非破坏性的结果，其中保留了初始 DataFrame 结构和索引：

    A   B
4   bar 2
0   foo 3

pandas 熊猫：如何选择每个 GROUP BY 组中的第一行？

提问by ihadanny

采纳答案by EdChum

回答by firelynx

回答by JohnE

回答by Dídac Fernández

回答by fpersyn

相关推荐

最近更新

标签

pandas 熊猫：如何选择每个 GROUP BY 组中的第一行？

提问by ihadanny

采纳答案by EdChum

回答by firelynx

回答by JohnE

回答by Dídac Fernández

回答by fpersyn

相关推荐

使用 Pandas 读取 CSV 文件：复杂分隔符

Python Pandas 中的慢速随机实现

将 pandas DataFrame 的索引增加 1

在 pandas Series 中设置值很慢，为什么？

相关推荐

最近更新

标签