Python 做groupby时保留其他列

Question

提问by PointXIV

I'm using groupbyon a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

我在groupbyPandas 数据框上使用删除没有特定列最小值的所有行。像这样的东西：

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuffin my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

但是，如果我的列多于这两列，则其他列（例如otherstuff在我的示例中）将被删除。我可以使用保留这些列groupby，还是必须找到不同的方法来删除行？

My data looks like:

我的数据看起来像：

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

最终应该是这样的：

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

但我得到的是：

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0

I've been looking through the documentation and can't find anything. I tried:

我一直在查看文档，但找不到任何内容。我试过：

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

但是这些都不起作用（我在最后一个意识到语法用于在创建组后进行聚合）。

Answer 1

采纳答案by DSM

Method #1: use idxmin()to get the indicesof the elements of minimum diff, and then select those:

方法#1：使用idxmin()获取minimum 元素的索引diff，然后选择那些：

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each itemgroup:

方法#2：按排序diff，然后取每组中的第一个元素item：

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

请注意，即使行内容相同，生成的索引也不同。

Answer 2

回答by citynorman

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin()doesn't give you. This worked

如果有/你想要一分钟，上面的答案效果很好。在我的情况下，可能有多个分钟，我希望所有行都等于 min ，.idxmin()这不会给你。这有效

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter()is also relevant to this question but didn't work for me.

顺便说一句，.filter()也与这个问题有关，但对我不起作用。

Answer 3

回答by Nic Scozzaro

If you know that all of your "items" have more than one record you can sort, then use duplicated:

如果您知道所有“项目”都有多个可以排序的记录，请使用duplicated：

df.sort_values(by='diff').duplicated(subset='item', keep='first')

Answer 4

回答by jezrael

You can use DataFrame.sort_valueswith DataFrame.drop_duplicates:

你可以用DataFrame.sort_values与DataFrame.drop_duplicates：

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexingwith transformfor minimal values per groups:

如果每个组可能的多个最小值和希望所有分列使用boolean indexing与transform每团最少值：

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0

Python 做groupby时保留其他列

提问by PointXIV

采纳答案by DSM

回答by citynorman

回答by Nic Scozzaro

回答by jezrael

相关推荐

最近更新

标签

Python 做groupby时保留其他列

提问by PointXIV

采纳答案by DSM

回答by citynorman

回答by Nic Scozzaro

回答by jezrael

相关推荐

Python 导入错误：没有名为“paramiko”的模块

带有 MySQL 数据库的 Python 3.4.0

Python 熊猫将数据框与 NaN（或“未知”）合并以获取缺失值

Python 将元组列表转换为结构化的 numpy 数组

相关推荐

最近更新

标签