pandas 熊猫检查另一个数据框中是否存在行并附加索引

Question

提问by tupan

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a loooong time and I don't have that many rows (I have like 300k rows)

我在迭代我的数据帧时遇到了一个问题。我正在做的方式是花费很长时间，而且我没有那么多行（我有 30 万行）

What am I trying to do?

我想做什么？

Check if one DF (A) contains the value of two columns of the other DF (B). You can think this as a multiple key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:
a. append to DF.B the two columns not found
b. assign the new ID to DF.A (I couldn't do this one)

检查一个 DF (A) 是否包含另一个 DF (B) 的两列值。您可以将其视为多键字段
如果为 True，则获取 DF.B 的索引并分配给 DF.A 的一列
如果为 False，则分两步：
一种。将未找到的两列附加到 DF.B
湾将新 ID 分配给 DF.A（我做不到这一点）

This is my code, where:

这是我的代码，其中：

df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes

Real_ID is the column which I want to assign the id of DF.B (df_id)

for index, row in df.iterrows():
    #check if columns exist in the other dataframe
    real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]

    if real_id.empty:
        #row does not exist, append to df_id
        df_id = df_id.append(row[['SampleID','ParentID']])
    else:
        #row exists, assign id of df_id to df
        row['Real_ID'] = real_id.index

df 是 DF.A，df_id 是 DF.B：
SampleID 和 ParentID 是我有兴趣检查它们是否存在于两个数据框中的两列

Real_ID 是我要分配 DF.B (df_id) id 的列

for index, row in df.iterrows():
    #check if columns exist in the other dataframe
    real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]

    if real_id.empty:
        #row does not exist, append to df_id
        df_id = df_id.append(row[['SampleID','ParentID']])
    else:
        #row exists, assign id of df_id to df
        row['Real_ID'] = real_id.index

EXAMPLE:

例子：

DF.A (df)

   Real_ID   SampleID   ParentID  Something AnotherThing
0             20          21          a          b      
1             10          11          a          b      
2             40          51          a          b

DF.B (df_id)

   SampleID   ParentID  
0    10          11         
1    20          21

Result:

结果：

   Real_ID   SampleID   ParentID  Something AnotherThing
0      1      10          11          a          b      
1      0      20          21          a          b      
2      2      40          51          a          b      


   SampleID   ParentID  
0    20          21         
1    10          11    
2    40          51

Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...

同样，此解决方案非常慢。我相信有更好的方法来做到这一点，这就是我在这里问的原因。不幸的是，这是我几个小时后得到的......

Thanks

谢谢

Answer 1

回答by MaxU

you can do it this way:

你可以这样做：

Data (pay attention at the index in the BDF):

数据（注意BDF中的索引）：

In [276]: cols = ['SampleID', 'ParentID']

In [277]: A
Out[277]:
   Real_ID  SampleID  ParentID Something AnotherThing
0      NaN        10        11         a            b
1      NaN        20        21         a            b
2      NaN        40        51         a            b

In [278]: B
Out[278]:
   SampleID  ParentID
3        10        11
5        20        21

Solution:

解决方案：

In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)

In [280]: merged
Out[280]:
   SampleID  ParentID     _merge
0        10        11       both
1        20        21       both
2        40        51  left_only


In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])

In [282]: B
Out[282]:
   SampleID  ParentID
3        10        11
5        20        21
2        40        51

In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']

In [286]: A
Out[286]:
   Real_ID  SampleID  ParentID Something AnotherThing
0        3        10        11         a            b
1        5        20        21         a            b
2        2        40        51         a            b

pandas 熊猫检查另一个数据框中是否存在行并附加索引

提问by tupan

回答by MaxU

相关推荐

最近更新

标签

pandas 熊猫检查另一个数据框中是否存在行并附加索引

提问by tupan

回答by MaxU

相关推荐

pandas 在数据框中查找空值的有效方法

pandas 熊猫：分组和聚合而不会丢失被分组的列

pandas 调用resample后如何用0填充（）？

pandas 用python划分两个数据帧

相关推荐

最近更新

标签