Pandas - 基于条件的重复行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43053814/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:17:11  来源:igfitidea点击:

Pandas - Duplicate Row based on condition

pythonpandasgroup-byduplicates

提问by Walt Reed

I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby.

如果行满足条件,我正在尝试创建一个重复的行。在下表中,我根据 groupby 创建了一个累积计数,然后对 groupby 的 MAX 进行了另一个计算。

df['PathID'] = df.groupby(DateCompleted).cumcount() + 1
df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max)

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

In this case, I want to duplicate only the record for 2/1/17 since there is only one instance for that date (i.e. where the MaxPathID == 1).

在这种情况下,我只想复制 2/1/17 的记录,因为该日期只有一个实例(即 MaxPathID == 1)。

Desired Output:

期望输出:

Date Completed    PathID    MaxPathID
1/31/17           1         3
1/31/17           2         3
1/31/17           3         3
2/1/17            1         1
2/1/17            1         1
2/2/17            1         2
2/2/17            2         2

Thanks in advance!

提前致谢!

采纳答案by jezrael

I think you need get uniquerows by Date Completedand then concatrows to original:

我认为您需要先获取uniqueDate Completed,然后再concat获取原始行:

df1 = df.loc[~df['Date Completed'].duplicated(keep=False), ['Date Completed']]
print (df1)
  Date Completed
3         2/1/17

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
df['PathID'] = df.groupby('Date Completed').cumcount() + 1
df['MaxPathID'] = df.groupby('Date Completed')['PathID'].transform(max)
print (df)
  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          2
6         2/1/17       2          2
4         2/2/17       1          2
5         2/2/17       2          2

EDIT:

编辑:

print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7

df1 = df[~df['Date Completed'].duplicated(keep=False)]
#alternative - boolean indexing by numpy array
#df1 = df[~df['Date Completed'].duplicated(keep=False).values]
print (df1)
  Date Completed  a  b
3         2/1/17  7  9

df = pd.concat([df,df1], ignore_index=True).sort_values('Date Completed')
print (df)
  Date Completed  a  b
0        1/31/17  4  5
1        1/31/17  3  5
2        1/31/17  6  3
3         2/1/17  7  9
6         2/1/17  7  9
4         2/2/17  2  0
5         2/2/17  6  7

回答by piRSquared

A creative numpyapproach using duplicated+ repeat

numpy使用duplicated+的创造性方法repeat

dc = df['Date Completed']
rg = np.arange(len(dc)).repeat((~dc.duplicated(keep=False).values) + 1)
df.iloc[rg]

  Date Completed  PathID  MaxPathID
0        1/31/17       1          3
1        1/31/17       2          3
2        1/31/17       3          3
3         2/1/17       1          1
3         2/1/17       1          1
4         2/2/17       1          2
5         2/2/17       2          2