合并 Pandas 数据框中的行(在某个条件下)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24986968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:17:54  来源:igfitidea点击:

Combine rows (under a condition) in a Pandas Dataframe

pythonpandasdata-cleaning

提问by Blue482

I have a dataframe (df) (orginally from a excel file) and the first 9 rows are like this:

我有一个数据框 (df)(最初来自 excel 文件),前 9 行是这样的:

      Control      Recd_Date/Due_Date                Action        Signature/Requester
0     2000-1703   2000-01-31 00:00:00           OC/OER/OPA/PMS/                 M WEBB
1           NaN   2000-02-29 00:00:00                       NaN              DATA CORP
2     2000-1776   2000-01-02 00:00:00            OC/ORA/OE/DCP/                  G KAN
3           NaN   2000-01-03 00:00:00           OC/ORA/ORO/PNC/              PALM POST
4           NaN                   NaN  FDA/OGROP/ORA/SE-FO/FLA-                    NaN
5           NaN                   NaN                DO/FLA-CB/                    NaN
6     2000-1983   2000-02-02 00:00:00  FDA/OGROP/ORA/CE-FO/CHI-                 M EGAN
7           NaN   2000-02-03 00:00:00                DO/CHI-CB/   BERNSTEIN LIEBHARD &
8           NaN                   NaN                       NaN             LONDON LLP
  • Type(df['Control'][1])=float;
  • Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
  • type(df['Action_Office'][1])=float;
  • Type(df['Signature/Requester'][1])=unicode
  • Type(df['Control'][1])=float;
  • Type(df['Recd_Date/Due_Date'][1])=datetime.datetime;
  • type(df['Action_Office'][1])=float;
  • 类型(df['签名/请求者'][1])=unicode

I want to transform this dataframe (e.g. first 9 rows) to this:

我想将此数据框(例如前 9 行)转换为:

      Control            Recd_Date/Due_Date                           Action                                                            Signature/Requester
0     2000-1703   2000-01-31 00:00:00,2000-02-29 00:00:00           OC/OER/OPA/PMS/                                                      M WEBB,DATA CORP
1     2000-1776   2000-01-02 00:00:00,2000-01-03 00:00:00           OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/      G KAN,PALM POST
2     2000-1983   2000-02-02 00:00:00,2000-02-03 00:00:00           FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/                                   M EGAN,BERNSTEIN LIEBHARD & LONDON LLP

So basically:

所以基本上:

  • Everytime pd.isnull(row['Control']) (This should be the only if condition) is true then merge this row with the previous row (whose 'control' value is not null).
  • And for 'Recd_Date/Due_Date' and 'Signature/Requester', add ',' (or '/') between each two values (from two merged rows) (e.g. '2000-01-31 00:00:00,2000-02-29 00:00:00' and 'G KAN,PALM POST')
  • For 'Action', simply merge them without any punctuations added (e.g. FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)
  • 每次 pd.isnull(row['Control']) (这应该是唯一的 if 条件)为真时,然后将此行与前一行(其 'control' 值不为空)合并。
  • 对于“Recd_Date/Due_Date”和“Signature/Requester”,在每两个值(来自两个合并行)之间添加“,”(或“/”)(例如“2000-01-31 00:00:00,2000- 02-29 00:00:00' 和 'G KAN, PALM POST')
  • 对于“Action”,只需将它们合并而不添加任何标点符号(例如 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/)

Can anyone help me out pls? This is the code im trying to get it to work:

任何人都可以帮助我吗?这是我试图让它工作的代码:

for i, row in df.iterrows():
    if pd.isnull(df.ix[i]['Control_#']):
       df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
       df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
       if str(df.ix[i-1]['Action_Office'])[-1] == '-':
           df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
       else:
           df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
       if pd.isnull(df.ix[i-1]['Signature/Requester']):
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
       elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
       else:
           df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
       df.drop(df.index[i])

How come the drop() doesn't work? I am trying drop the current row (if its ['Control_#'] is null) so the next row (whose ['Control_#'] is null) can be added to the previous row (whose ['Control_#'] is NOT null) iteratively..

为什么 drop() 不起作用?我正在尝试删除当前行(如果其 ['Control_#'] 为空),则可以将下一行(其 ['Control_#'] 为空)添加到前一行(其 ['Control_#'] 是NOT null) 迭代..

Much appreciated!!

非常感激!!

回答by ZJS

I think you need to group the rows together and then join up the column values. The tricky part is finding a way to group together the rows in the way you want. Here is my solution...

我认为您需要将行组合在一起,然后连接列值。棘手的部分是找到一种以您想要的方式将行组合在一起的方法。这是我的解决方案...

1) Grouping Together the Rows: Static variables

1) 将行组合在一起:静态变量

Since your groups depend on a sequence in your rows I used a static variable in a method to label every row to a specific group

由于您的组取决于行中的序列,因此我在方法中使用了一个静态变量将每一行标记为特定组

def rolling_group(val):
    if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
    return rolling_group.group
rolling_group.group = 0 #static variable

This method is applied along the Control series to sort indexes into groups, which is then used to split up the dataframe to allow you to merge rows

此方法沿 Control 系列应用以将索引排序为组,然后用于拆分数据框以允许您合并行

#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)

That is really the only tricky part after that you can just merge the rows by applying a function to each group that gives you your desired output

这确实是之后唯一棘手的部分,您可以通过对每个组应用一个函数来合并行,从而为您提供所需的输出

Full Solution Code

完整的解决方案代码

def rolling_group(val):
    if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
    return rolling_group.group
rolling_group.group = 0 #static variable

def joinFunc(g,column):
    col =g[column]
    joiner = "/" if column == "Action" else ","
    s = joiner.join([str(each) for each in col if pd.notnull(each)])
    s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
    s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
    s = re.sub(joiner*2,joiner,s)    #fixes double joiner condition
    return s

#edit above - str(each) - to convert to strings... edit above regex to clean join string joins

#edit above - str(each) - 转换为字符串...在正则表达式上方编辑以清理连接字符串连接

if __name__ == "__main__":
    df = """      Control      Recd_Date/Due_Date                Action        Signature/Requester
0     2000-1703   2000-01-31 00:00:00           OC/OER/OPA/PMS/                 M WEBB
1           NaN   2000-02-29 00:00:00                       NaN              DATA CORP
2     2000-1776   2000-01-02 00:00:00            OC/ORA/OE/DCP/                  G KAN
3           NaN   2000-01-03 00:00:00           OC/ORA/ORO/PNC/              PALM POST
4           NaN                   NaN  FDA/OGROP/ORA/SE-FO/FLA-                    NaN
5           NaN                   NaN                DO/FLA-CB/                    NaN
6     2000-1983   2000-02-02 00:00:00  FDA/OGROP/ORA/CE-FO/CHI-                 M EGAN
7           NaN   2000-02-03 00:00:00                DO/CHI-CB/   BERNSTEIN LIEBHARD &
8           NaN                   NaN                       NaN             LONDON LLP"""
    df =  pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')

    groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
    groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
    print groups.apply(groupFunct)

output

输出

     Control                       Recd_Date/Due_Date  \
0  2000-1703  2000-01-31 00:00:00,2000-02-29 00:00:00   
1  2000-1776  2000-01-02 00:00:00,2000-01-03 00:00:00   
2  2000-1983  2000-02-02 00:00:00,2000-02-03 00:00:00   

                                              Action  \
0                                    OC/OER/OPA/PMS/   
1  OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...   
2                 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/   

                      Signature/Requester  
0                        M WEBB,DATA CORP  
1                         G KAN,PALM POST  
2  M EGAN,BERNSTEIN LIEBHARD & LONDON LLP