pandas 熊猫检查另一个数据框中是否存在行并附加索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39582138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas check if row exist in another dataframe and append index
提问by tupan
I'm having one problem to iterate over my dataframe. The way I'm doing is taking a loooong time and I don't have that many rows (I have like 300k rows)
我在迭代我的数据帧时遇到了一个问题。我正在做的方式是花费很长时间,而且我没有那么多行(我有 30 万行)
What am I trying to do?
我想做什么?
Check if one DF (A) contains the value of two columns of the other DF (B). You can think this as a multiple key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:
a. append to DF.B the two columns not found
b. assign the new ID to DF.A (I couldn't do this one)
检查一个 DF (A) 是否包含另一个 DF (B) 的两列值。您可以将其视为多键字段
如果为 True,则获取 DF.B 的索引并分配给 DF.A 的一列
如果为 False,则分两步:
一种。将未找到的两列附加到 DF.B
湾 将新 ID 分配给 DF.A(我做不到这一点)
This is my code, where:
这是我的代码,其中:
df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column which I want to assign the id of DF.B (df_id)
for index, row in df.iterrows(): #check if columns exist in the other dataframe real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])] if real_id.empty: #row does not exist, append to df_id df_id = df_id.append(row[['SampleID','ParentID']]) else: #row exists, assign id of df_id to df row['Real_ID'] = real_id.index
df 是 DF.A,df_id 是 DF.B:
SampleID 和 ParentID 是我有兴趣检查它们是否存在于两个数据框中的两列
Real_ID 是我要分配 DF.B (df_id) id 的列
for index, row in df.iterrows(): #check if columns exist in the other dataframe real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])] if real_id.empty: #row does not exist, append to df_id df_id = df_id.append(row[['SampleID','ParentID']]) else: #row exists, assign id of df_id to df row['Real_ID'] = real_id.index
EXAMPLE:
例子:
DF.A (df)
DF.A (df)
Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b
DF.B (df_id)
DF.B (df_id)
SampleID ParentID
0 10 11
1 20 21
Result:
结果:
Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b
SampleID ParentID
0 20 21
1 10 11
2 40 51
Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...
同样,此解决方案非常慢。我相信有更好的方法来做到这一点,这就是我在这里问的原因。不幸的是,这是我几个小时后得到的......
Thanks
谢谢
回答by MaxU
you can do it this way:
你可以这样做:
Data (pay attention at the index in the B
DF):
数据(注意B
DF中的索引):
In [276]: cols = ['SampleID', 'ParentID']
In [277]: A
Out[277]:
Real_ID SampleID ParentID Something AnotherThing
0 NaN 10 11 a b
1 NaN 20 21 a b
2 NaN 40 51 a b
In [278]: B
Out[278]:
SampleID ParentID
3 10 11
5 20 21
Solution:
解决方案:
In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)
In [280]: merged
Out[280]:
SampleID ParentID _merge
0 10 11 both
1 20 21 both
2 40 51 left_only
In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])
In [282]: B
Out[282]:
SampleID ParentID
3 10 11
5 20 21
2 40 51
In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']
In [286]: A
Out[286]:
Real_ID SampleID ParentID Something AnotherThing
0 3 10 11 a b
1 5 20 21 a b
2 2 40 51 a b