Python:Pandas - 按组删除第一行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31226142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Pandas - Delete the first row by group
提问by Plug4
I have the following large dataframe (df) that looks like this:
我有以下大数据框 ( df),如下所示:
ID date PRICE
1 10001 19920103 14.500
2 10001 19920106 14.500
3 10001 19920107 14.500
4 10002 19920108 15.125
5 10002 19920109 14.500
6 10002 19920110 14.500
7 10003 19920113 14.500
8 10003 19920114 14.500
9 10003 19920115 15.000
Question:What's the most efficient way to delete (or remove) the first row of each ID? I want this:
问题:删除(或移除)每个 ID 的第一行的最有效方法是什么?我要这个:
ID date PRICE
2 10001 19920106 14.500
3 10001 19920107 14.500
5 10002 19920109 14.500
6 10002 19920110 14.500
8 10003 19920114 14.500
9 10003 19920115 15.000
I can do a loop over each unique IDand remove the first row but I believe this is not very efficient.
我可以对每个唯一ID的行进行循环并删除第一行,但我认为这不是很有效。
回答by Jianxun Li
Another one line code is df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])
另一行代码是 df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])
Out[100]:
date PRICE
ID
10001 2 19920106 14.5
3 19920107 14.5
10002 5 19920109 14.5
6 19920110 14.5
10003 8 19920114 14.5
9 19920115 15.0
回答by unutbu
You could use groupby/transformto prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:
您可以使用groupby/transform准备一个布尔掩码,该掩码对于您想要的行为 True,对于您不想要的行为 False。一旦有了这样的布尔掩码,就可以使用df.loc[mask]以下命令选择子数据帧:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],
'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],
'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
19920113, 19920114, 19920115]},
index = range(1,10))
def mask_first(x):
result = np.ones_like(x)
result[0] = 0
return result
mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
print(df.loc[mask])
yields
产量
ID PRICE date
2 10001 14.5 19920106
3 10001 14.5 19920107
5 10002 14.5 19920109
6 10002 14.5 19920110
8 10003 14.5 19920114
9 10003 15.0 19920115
Since you're interested in efficiency, here is a benchmark:
由于您对效率感兴趣,这里是一个基准:
import timeit
import operator
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame(
{'ID': np.random.randint(100, size=(N,)),
'PRICE': np.random.random(N),
'date': np.random.random(N)})
def using_mask(df):
def mask_first(x):
result = np.ones_like(x)
result[0] = 0
return result
mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
return df.loc[mask]
def using_apply(df):
return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])
def using_apply_alt(df):
return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])
timing = dict()
for func in (using_mask, using_apply, using_apply_alt):
timing[func] = timeit.timeit(
'{}(df)'.format(func.__name__),
'from __main__ import df, {}'.format(func.__name__), number=100)
for func, t in sorted(timing.items(), key=operator.itemgetter(1)):
print('{:16}: {:.2f}'.format(func.__name__, t))
reports
报告
using_mask : 0.85
using_apply_alt : 2.04
using_apply : 3.70
回答by Rriskit
Old but still watched quite often: a much faster solution is nth(0) combined with drop duplicates:
旧但仍然经常观看:更快的解决方案是 nth(0) 结合删除重复项:
def using_nth(df):
to_del = df.groupby('ID',as_index=False).nth(0)
return pd.concat([df,to_del]).drop_duplicates(keep=False)
In my system the times for unutbus setting are:
在我的系统中,unutbus 设置的时间是:
using_nth : 0.43
using_apply_alt : 1.93
using_mask : 2.11
using_apply : 4.33

