基于三列将一个 Pandas 数据帧中的行与另一行匹配

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24738732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:15:43  来源:igfitidea点击:

Match rows in one Pandas dataframe to another based on three columns

pythonpandasdataframe

提问by Alex M-R

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

我有两个 Pandas 数据框,一个很大(30000 多行),一个小很多(100 多行)。

The dfA looks something like:

dfA 看起来像:

      X     Y    ONSET_TIME    COLOUR 
0   104    78          1083         6    
1   172    78          1083        16
2   240    78          1083        15 
3   308    78          1083         8
4   376    78          1083         8
5   444    78          1083        14
6   512    78          1083        14
... ...   ...           ...       ...

The dfB looks something like:

dfB 看起来像:

    TIME     X     Y
0      7   512   350 
1   1722   512   214 
2   1906   376   214 
3   2095   376   146 
4   2234   308    78 
5   2406   172   146
...  ...   ...   ...  

What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.

我想要做的是为 dfB 中的每一行找到 dfA 中 X 和 Y 列的值相等的行,即 dfB['TIME'] 的值大于 dfA[' 的第一行ONSET_TIME'] 并返回该行的 dfA['COLOUR'] 值。

dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).

dfA 代表显示器的刷新,其中 X 和 Y 是显示器上项目的坐标,因此对于每个不同的 ONSET_TIME 重复它们自己(ONSET_TIME 的每个值有 108 对坐标)。

There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.

将有多行,其中两个数据帧中的 X 和 Y 相等,但我也需要与时间匹配的行。

I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.

我已经使用 for 循环和 if 语句完成了此操作,只是为了查看它是否可以完成,但显然考虑到数据帧的大小,这需要很长时间。

for s in range(0, len(dfA)):
    for r in range(0, len(dfB)):
        if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
            return dfA.iloc[s,3]

采纳答案by flyingmeatball

There is probably an even more efficient way to do this, but here is a method without those slow for loops:

可能有一种更有效的方法来做到这一点,但这里有一种没有那些缓慢的 for 循环的方法:

import pandas as pd

dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})

#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()

print filteredDf

 COLOR  ONSET_TIME  X  Y  Time
0     Red           5  1  1    10
1    Blue           7  1  1    10
2    Blue           9  2  2    20
3     red          16  2  2    20
5  Orange          28  3  3    30


print groupedDf

COLOR  ONSET_TIME  Time
X Y                          
1 1     Red           7    10
2 2     red          16    20
3 3  Orange          28    30

The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

基本思想是合并两个表,这样您就可以将时间放在一张表中。然后我过滤了最大的记录(最接近您的 dfB 上的时间)。如果您对此有任何疑问,请告诉我。

回答by furas

Use merge()- it works like JOINin SQL - and you have first part done.

使用merge()- 它像JOIN在 SQL 中一样工作 - 你已经完成了第一部分。

d1 = '''      X     Y    ONSET_TIME    COLOUR 
   104    78          1083         6    
   172    78          1083        16
   240    78          1083        15 
   308    78          1083         8
   376    78          1083         8
   444    78          1083        14
   512    78          1083        14
   308    78          3000        14
   308    78          2000        14''' 


d2 = '''    TIME     X     Y
      7   512   350 
   1722   512   214 
   1906   376   214 
   2095   376   146 
   2234   308    78 
   2406   172   146'''

import pandas as pd
from StringIO import StringIO

dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA

dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB

df1 =  pd.merge(dfA, dfB, on=['X','Y'])
print df1

result:

结果:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
1  308  78        3000      14  2234
2  308  78        2000      14  2234

Then you can use it to filter results.

然后你可以用它来过滤结果。

df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2

result:

结果:

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
2  308  78        2000      14  2234