Pandas - 基于条件的重复行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43053814/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Duplicate Row based on condition
提问by Walt Reed
I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby.
如果行满足条件,我正在尝试创建一个重复的行。在下表中,我根据 groupby 创建了一个累积计数,然后对 groupby 的 MAX 进行了另一个计算。
df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)
Date Completed PathID MaxPathID
1/31/17 1 3
1/31/17 2 3
1/31/17 3 3
2/1/17 1 1
2/2/17 1 2
2/2/17 2 2
In this case, I want to duplicate only the record for 2/1/17 since there is only one instance for that date (i.e. where the MaxPathID == 1).
在这种情况下,我只想复制 2/1/17 的记录,因为该日期只有一个实例(即 MaxPathID == 1)。
Desired Output:
期望输出:
Date Completed PathID MaxPathID
1/31/17 1 3
1/31/17 2 3
1/31/17 3 3
2/1/17 1 1
2/1/17 1 1
2/2/17 1 2
2/2/17 2 2
Thanks in advance!
提前致谢!
采纳答案by jezrael
I think you need get unique
rows by Date Completed
and then concat
rows to original:
我认为您需要先获取unique
行Date Completed
,然后再concat
获取原始行:
df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
Date Completed
3 2/1/17
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
Date Completed PathID MaxPathID
0 1/31/17 1 3
1 1/31/17 2 3
2 1/31/17 3 3
3 2/1/17 1 2
6 2/1/17 2 2
4 2/2/17 1 2
5 2/2/17 2 2
EDIT:
编辑:
print (df)
Date Completed a b
0 1/31/17 4 5
1 1/31/17 3 5
2 1/31/17 6 3
3 2/1/17 7 9
4 2/2/17 2 0
5 2/2/17 6 7
df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
Date Completed a b
3 2/1/17 7 9
df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
Date Completed a b
0 1/31/17 4 5
1 1/31/17 3 5
2 1/31/17 6 3
3 2/1/17 7 9
6 2/1/17 7 9
4 2/2/17 2 0
5 2/2/17 6 7
回答by piRSquared
A creative numpy
approach using duplicated
+ repeat
numpy
使用duplicated
+的创造性方法repeat
dc = df['Date Completed']
rg = np.arange(len(dc)).repeat((~dc.duplicated(keep=False).values) + 1)
df.iloc[rg]
Date Completed PathID MaxPathID
0 1/31/17 1 3
1 1/31/17 2 3
2 1/31/17 3 3
3 2/1/17 1 1
3 2/1/17 1 1
4 2/2/17 1 2
5 2/2/17 2 2