基于三列将一个 Pandas 数据帧中的行与另一行匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24738732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Match rows in one Pandas dataframe to another based on three columns
提问by Alex M-R
I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).
我有两个 Pandas 数据框,一个很大(30000 多行),一个小很多(100 多行)。
The dfA looks something like:
dfA 看起来像:
X Y ONSET_TIME COLOUR
0 104 78 1083 6
1 172 78 1083 16
2 240 78 1083 15
3 308 78 1083 8
4 376 78 1083 8
5 444 78 1083 14
6 512 78 1083 14
... ... ... ... ...
The dfB looks something like:
dfB 看起来像:
TIME X Y
0 7 512 350
1 1722 512 214
2 1906 376 214
3 2095 376 146
4 2234 308 78
5 2406 172 146
... ... ... ...
What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.
我想要做的是为 dfB 中的每一行找到 dfA 中 X 和 Y 列的值相等的行,即 dfB['TIME'] 的值大于 dfA[' 的第一行ONSET_TIME'] 并返回该行的 dfA['COLOUR'] 值。
dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).
dfA 代表显示器的刷新,其中 X 和 Y 是显示器上项目的坐标,因此对于每个不同的 ONSET_TIME 重复它们自己(ONSET_TIME 的每个值有 108 对坐标)。
There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.
将有多行,其中两个数据帧中的 X 和 Y 相等,但我也需要与时间匹配的行。
I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.
我已经使用 for 循环和 if 语句完成了此操作,只是为了查看它是否可以完成,但显然考虑到数据帧的大小,这需要很长时间。
for s in range(0, len(dfA)):
for r in range(0, len(dfB)):
if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
return dfA.iloc[s,3]
采纳答案by flyingmeatball
There is probably an even more efficient way to do this, but here is a method without those slow for loops:
可能有一种更有效的方法来做到这一点,但这里有一种没有那些缓慢的 for 循环的方法:
import pandas as pd
dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})
#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()
print filteredDf
COLOR ONSET_TIME X Y Time
0 Red 5 1 1 10
1 Blue 7 1 1 10
2 Blue 9 2 2 20
3 red 16 2 2 20
5 Orange 28 3 3 30
print groupedDf
COLOR ONSET_TIME Time
X Y
1 1 Red 7 10
2 2 red 16 20
3 3 Orange 28 30
The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.
基本思想是合并两个表,这样您就可以将时间放在一张表中。然后我过滤了最大的记录(最接近您的 dfB 上的时间)。如果您对此有任何疑问,请告诉我。
回答by furas
Use merge()- it works like JOINin SQL - and you have first part done.
使用merge()- 它像JOIN在 SQL 中一样工作 - 你已经完成了第一部分。
d1 = ''' X Y ONSET_TIME COLOUR
104 78 1083 6
172 78 1083 16
240 78 1083 15
308 78 1083 8
376 78 1083 8
444 78 1083 14
512 78 1083 14
308 78 3000 14
308 78 2000 14'''
d2 = ''' TIME X Y
7 512 350
1722 512 214
1906 376 214
2095 376 146
2234 308 78
2406 172 146'''
import pandas as pd
from StringIO import StringIO
dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA
dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB
df1 = pd.merge(dfA, dfB, on=['X','Y'])
print df1
result:
结果:
X Y ONSET_TIME COLOUR TIME
0 308 78 1083 8 2234
1 308 78 3000 14 2234
2 308 78 2000 14 2234
Then you can use it to filter results.
然后你可以用它来过滤结果。
df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2
result:
结果:
X Y ONSET_TIME COLOUR TIME
0 308 78 1083 8 2234
2 308 78 2000 14 2234

