Python Pandas Merge (pd.merge) 如何设置索引和连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14341805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:07:51  来源:igfitidea点击:

Pandas Merge (pd.merge) How to set the index and join

pythonpandas

提问by user1911092

I have two pandas dataframes: dfLeft and dfRight with the date as the index.

我有两个熊猫数据框:dfLeft 和 dfRight,以日期为索引。

dfLeft:

df左:

            cusip    factorL
date  
2012-01-03    XXXX      4.5
2012-01-03    YYYY      6.2
....
2012-01-04    XXXX      4.7
2012-01-04    YYYY      6.1
....

dfRight:

右:

            idc__id    factorR
date  
2012-01-03    XXXX      5.0
2012-01-03    YYYY      6.0
....
2012-01-04    XXXX      5.1
2012-01-04    YYYY      6.2

Both have a shape close to (121900,3)

两者的形状都接近 (121900,3)

I tried the following merge:

我尝试了以下合并:

test = pd.merge(dfLeft, dfRight, left_index=True, right_index=True, left_on='cusip', right_on='idc__id', how = 'inner')

This gave test a shape of (60643500, 6).

这给了 test 一个形状(60643500, 6)

Any recommendations on what is going wrong here? I want it to merge based on both date and cusip/idc_id. Note: for this example the cusips are lined up, but in reality that may not be so.

关于这里出了什么问题的任何建议?我希望它根据日期和 cusip/idc_id 进行合并。注意:对于这个例子,尖头是一字排开的,但实际上可能并非如此。

Thanks.

谢谢。

Expected Output test:

预期输出测试:

             cusip    factorL    factorR
date  
2012-01-03    XXXX      4.5          5.0
2012-01-03    YYYY      6.2          6.0
....
2012-01-04    XXXX      4.7          5.1
2012-01-04    YYYY      6.1          6.2

采纳答案by Andy Hayden

You could append 'cuspin'and 'idc_id'as a indices to your DataFrames before you join(here's how it would work on the first couple of rows):

您可以在您之前将'cuspin''idc_id'作为索引附加到您的数据帧join(这是它在前几行上的工作方式):

In [10]: dfL
Out[10]: 
           cuspin  factorL
date                      
2012-01-03   XXXX      4.5
2012-01-03   YYYY      6.2

In [11]: dfL1 = dfLeft.set_index('cuspin', append=True)

In [12]: dfR1 = dfRight.set_index('idc_id', append=True)

In [13]: dfL1
Out[13]: 
                   factorL
date       cuspin         
2012-01-03 XXXX        4.5
           YYYY        6.2

In [14]: dfL1.join(dfR1)
Out[14]: 
                   factorL  factorR
date       cuspin                  
2012-01-03 XXXX        4.5        5
           YYYY        6.2        6

回答by Theodros Zelleke

Reset the indices and then merge on multiple (column-)keys:

重置索引,然后在多个(列)键上合并:

dfLeft.reset_index(inplace=True)
dfRight.reset_index(inplace=True)
dfMerged = pd.merge(dfLeft, dfRight,
              left_on=['date', 'cusip'],
              right_on=['date', 'idc__id'],
              how='inner')

You can then reset 'date' as an index:

然后,您可以将“日期”重置为索引:

dfMerged.set_index('date', inplace=True)

Here's an example:

下面是一个例子:

raw1 = '''
2012-01-03    XXXX      4.5
2012-01-03    YYYY      6.2
2012-01-04    XXXX      4.7
2012-01-04    YYYY      6.1
'''

raw2 = '''
2012-01-03    XYXX      45.
2012-01-03    YYYY      62.
2012-01-04    XXXX      -47.
2012-01-05    YYYY      61.
'''

import pandas as pd
from StringIO import StringIO


df1 = pd.read_table(StringIO(raw1), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)
df2 = pd.read_table(StringIO(raw2), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)

df1.columns = ['date', 'cusip', 'factorL']
df2.columns = ['date', 'idc__id', 'factorL']

print pd.merge(df1, df2,
         left_on=['date', 'cusip'],
         right_on=['date', 'idc__id'],
         how='inner')

which gives

这使

                  date cusip  factorL_x idc__id  factorL_y
0  2012-01-03 00:00:00  YYYY        6.2    YYYY         62
1  2012-01-04 00:00:00  XXXX        4.7    XXXX        -47