合并 Pandas 数据框中的行(在某个条件下)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24986968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Combine rows (under a condition) in a Pandas Dataframe
提问by Blue482
I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:
我有一个数据框 (df)(最初来自 excel 文件),前 9 行是这样的:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP
- Type(df['Control'][1])=float;
- Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
- type(df['Action_Office'][1])=float;
- Type(df['Signature/Requester'][1])=unicode
- Type(df['Control'][1])=float;
- Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
- type(df['Action_Office'][1])=float;
- 类型(df['签名/请求者'][1])=unicode
I want to transform this dataframe (e.g. first 9 rows) to this:
我想将此数据框(例如前 9 行)转换为:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00 OC/OER/OPA/PMS/ M WEBB,DATA CORP
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/ G KAN,PALM POST
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/ M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
So basically:
所以基本上:
- Everytime pd.isnull(row['Control']) (This should be the only if condition) is true then merge this row with the previous row (whose 'control' value is not null).
- And for 'Recd_Date/Due_Date' and 'Signature/Requester', add ',' (or '/') between each two values (from two merged rows) (e.g. '2000-01-31 00:00:00,2000-02-29 00:00:00' and 'G KAN,PALM POST')
- For 'Action', simply merge them without any punctuations added (e.g. FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)
- 每次 pd.isnull(row['Control']) (这应该是唯一的 if 条件)为真时,然后将此行与前一行(其 'control' 值不为空)合并。
- 对于“Recd_Date/Due_Date”和“Signature/Requester”,在每两个值(来自两个合并行)之间添加“,”(或“/”)(例如“2000-01-31 00:00:00,2000- 02-29 00:00:00' 和 'G KAN, PALM POST')
- 对于“Action”,只需将它们合并而不添加任何标点符号(例如 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)
Can anyone help me out pls? This is the code im trying to get it to work:
任何人都可以帮助我吗?这是我试图让它工作的代码:
for i, row in df.iterrows():
if pd.isnull(df.ix[i]['Control_#']):
df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
if str(df.ix[i-1]['Action_Office'])[-1] == '-':
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
else:
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
if pd.isnull(df.ix[i-1]['Signature/Requester']):
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
else:
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
df.drop(df.index[i])
How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..
为什么 drop() 不起作用?我正在尝试删除当前行(如果其 ['Control_#'] 为空),则可以将下一行(其 ['Control_#'] 为空)添加到前一行(其 ['Control_#'] 是NOT null) 迭代..
Much appreciated!!
非常感激!!
回答by ZJS
I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...
我认为您需要将行组合在一起,然后连接列值。棘手的部分是找到一种以您想要的方式将行组合在一起的方法。这是我的解决方案...
1) Grouping Together the Rows: Static variables
1) 将行组合在一起:静态变量
Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group
由于您的组取决于行中的序列,因此我在方法中使用了一个静态变量将每一行标记为特定组
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows
此方法沿 Control 系列应用以将索引排序为组,然后用于拆分数据框以允许您合并行
#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output
这确实是之后唯一棘手的部分,您可以通过对每个组应用一个函数来合并行,从而为您提供所需的输出
Full Solution Code
完整的解决方案代码
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
def joinFunc(g,column):
col =g[column]
joiner = "/" if column == "Action" else ","
s = joiner.join([str(each) for each in col if pd.notnull(each)])
s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
s = re.sub(joiner*2,joiner,s) #fixes double joiner condition
return s
#edit above - str(each) - to convert to strings... edit above regex to clean join string joins
#edit above - str(each) - 转换为字符串...在正则表达式上方编辑以清理连接字符串连接
if __name__ == "__main__":
df = """ Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP"""
df = pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')
groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
print groups.apply(groupFunct)
output
输出
Control Recd_Date/Due_Date \
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00
Action \
0 OC/OER/OPA/PMS/
1 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...
2 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/
Signature/Requester
0 M WEBB,DATA CORP
1 G KAN,PALM POST
2 M EGAN,BERNSTEIN LIEBHARD & LONDON LLP

