基于三列将一个 Pandas 数据帧中的行与另一行匹配

Question

提问by Alex M-R

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

我有两个 Pandas 数据框，一个很大（30000 多行），一个小很多（100 多行）。

The dfA looks something like:

dfA 看起来像：

      X     Y    ONSET_TIME    COLOUR 
0   104    78          1083         6    
1   172    78          1083        16
2   240    78          1083        15 
3   308    78          1083         8
4   376    78          1083         8
5   444    78          1083        14
6   512    78          1083        14
... ...   ...           ...       ...

The dfB looks something like:

dfB 看起来像：

    TIME     X     Y
0      7   512   350 
1   1722   512   214 
2   1906   376   214 
3   2095   376   146 
4   2234   308    78 
5   2406   172   146
...  ...   ...   ...

What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.

我想要做的是为 dfB 中的每一行找到 dfA 中 X 和 Y 列的值相等的行，即 dfB['TIME'] 的值大于 dfA[' 的第一行ONSET_TIME'] 并返回该行的 dfA['COLOUR'] 值。

dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).

dfA 代表显示器的刷新，其中 X 和 Y 是显示器上项目的坐标，因此对于每个不同的 ONSET_TIME 重复它们自己（ONSET_TIME 的每个值有 108 对坐标）。

There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.

将有多行，其中两个数据帧中的 X 和 Y 相等，但我也需要与时间匹配的行。

I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.

我已经使用 for 循环和 if 语句完成了此操作，只是为了查看它是否可以完成，但显然考虑到数据帧的大小，这需要很长时间。

for s in range(0, len(dfA)):
    for r in range(0, len(dfB)):
        if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
            return dfA.iloc[s,3]

Answer 1

采纳答案by flyingmeatball

There is probably an even more efficient way to do this, but here is a method without those slow for loops:

可能有一种更有效的方法来做到这一点，但这里有一种没有那些缓慢的 for 循环的方法：

import pandas as pd

dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})

#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()

print filteredDf

 COLOR  ONSET_TIME  X  Y  Time
0     Red           5  1  1    10
1    Blue           7  1  1    10
2    Blue           9  2  2    20
3     red          16  2  2    20
5  Orange          28  3  3    30


print groupedDf

COLOR  ONSET_TIME  Time
X Y                          
1 1     Red           7    10
2 2     red          16    20
3 3  Orange          28    30

The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

基本思想是合并两个表，这样您就可以将时间放在一张表中。然后我过滤了最大的记录（最接近您的 dfB 上的时间）。如果您对此有任何疑问，请告诉我。

Answer 2

回答by furas

Use merge()- it works like JOINin SQL - and you have first part done.

使用merge()- 它像JOIN在 SQL 中一样工作 - 你已经完成了第一部分。

d1 = '''      X     Y    ONSET_TIME    COLOUR 
   104    78          1083         6    
   172    78          1083        16
   240    78          1083        15 
   308    78          1083         8
   376    78          1083         8
   444    78          1083        14
   512    78          1083        14
   308    78          3000        14
   308    78          2000        14''' 


d2 = '''    TIME     X     Y
      7   512   350 
   1722   512   214 
   1906   376   214 
   2095   376   146 
   2234   308    78 
   2406   172   146'''

import pandas as pd
from StringIO import StringIO

dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA

dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB

df1 =  pd.merge(dfA, dfB, on=['X','Y'])
print df1

result:

结果：

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
1  308  78        3000      14  2234
2  308  78        2000      14  2234

Then you can use it to filter results.

然后你可以用它来过滤结果。

df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2

result:

结果：

     X   Y  ONSET_TIME  COLOUR  TIME
0  308  78        1083       8  2234
2  308  78        2000      14  2234

基于三列将一个 Pandas 数据帧中的行与另一行匹配

提问by Alex M-R

采纳答案by flyingmeatball

回答by furas

相关推荐

最近更新

标签

基于三列将一个 Pandas 数据帧中的行与另一行匹配

提问by Alex M-R

采纳答案by flyingmeatball

回答by furas

相关推荐

Pandas：时间戳索引四舍五入到最接近的第 5 分钟

pandas 熊猫在excel编写器中设置单元格格式

pandas 熊猫读取excel：不解析数字

pandas 计算 DataFrame 每一行中系列中项目的出现次数

相关推荐

最近更新

标签