pandas 熊猫检查另一个数据框中是否存在行并附加索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39582138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:02:36  来源:igfitidea点击:

Pandas check if row exist in another dataframe and append index

pythonpandas

提问by tupan

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a loooong time and I don't have that many rows (I have like 300k rows)

我在迭代我的数据帧时遇到了一个问题。我正在做的方式是花费很长时间,而且我没有那么多行(我有 30 万行)

What am I trying to do?

我想做什么?

  1. Check if one DF (A) contains the value of two columns of the other DF (B). You can think this as a multiple key field

  2. If True, get the index of DF.B and assign to one column of DF.A

  3. If False, two steps:

    a. append to DF.B the two columns not found

    b. assign the new ID to DF.A (I couldn't do this one)

  1. 检查一个 DF (A) 是否包含另一个 DF (B) 的两列值。您可以将其视为多键字段

  2. 如果为 True,则获取 DF.B 的索引并分配给 DF.A 的一列

  3. 如果为 False,则分两步:

    一种。将未找到的两列附加到 DF.B

    湾 将新 ID 分配给 DF.A(我做不到这一点)

This is my code, where:

这是我的代码,其中:

  1. df is DF.A and df_id is DF.B:

  2. SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes

  3. Real_ID is the column which I want to assign the id of DF.B (df_id)

    for index, row in df.iterrows():
        #check if columns exist in the other dataframe
        real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
    
        if real_id.empty:
            #row does not exist, append to df_id
            df_id = df_id.append(row[['SampleID','ParentID']])
        else:
            #row exists, assign id of df_id to df
            row['Real_ID'] = real_id.index
    
  1. df 是 DF.A,df_id 是 DF.B:

  2. SampleID 和 ParentID 是我有兴趣检查它们是否存在于两个数据框中的两列

  3. Real_ID 是我要分配 DF.B (df_id) id 的列

    for index, row in df.iterrows():
        #check if columns exist in the other dataframe
        real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
    
        if real_id.empty:
            #row does not exist, append to df_id
            df_id = df_id.append(row[['SampleID','ParentID']])
        else:
            #row exists, assign id of df_id to df
            row['Real_ID'] = real_id.index
    

EXAMPLE:

例子:

DF.A (df)

DF.A (df)

   Real_ID   SampleID   ParentID  Something AnotherThing
0             20          21          a          b      
1             10          11          a          b      
2             40          51          a          b       

DF.B (df_id)

DF.B (df_id)

   SampleID   ParentID  
0    10          11         
1    20          21     

Result:

结果

   Real_ID   SampleID   ParentID  Something AnotherThing
0      1      10          11          a          b      
1      0      20          21          a          b      
2      2      40          51          a          b      


   SampleID   ParentID  
0    20          21         
1    10          11    
2    40          51

Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...

同样,此解决方案非常慢。我相信有更好的方法来做到这一点,这就是我在这里问的原因。不幸的是,这是我几个小时后得到的......

Thanks

谢谢

回答by MaxU

you can do it this way:

你可以这样做:

Data (pay attention at the index in the BDF):

数据(注意BDF中的索引):

In [276]: cols = ['SampleID', 'ParentID']

In [277]: A
Out[277]:
   Real_ID  SampleID  ParentID Something AnotherThing
0      NaN        10        11         a            b
1      NaN        20        21         a            b
2      NaN        40        51         a            b

In [278]: B
Out[278]:
   SampleID  ParentID
3        10        11
5        20        21

Solution:

解决方案:

In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)

In [280]: merged
Out[280]:
   SampleID  ParentID     _merge
0        10        11       both
1        20        21       both
2        40        51  left_only


In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])

In [282]: B
Out[282]:
   SampleID  ParentID
3        10        11
5        20        21
2        40        51

In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']

In [286]: A
Out[286]:
   Real_ID  SampleID  ParentID Something AnotherThing
0        3        10        11         a            b
1        5        20        21         a            b
2        2        40        51         a            b