pandas 熊猫加入具有不同名称的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40570143/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:25:40  来源:igfitidea点击:

Pandas join on columns with different names

pythonsqlpandasmerge

提问by runningbirds

I have two different data frames that I want to perform some sql operations on. Unfortunately, as is the case with the data I'm working with, the spelling is often different.

我有两个不同的数据框,我想对其执行一些 sql 操作。不幸的是,就像我正在处理的数据一样,拼写通常不同。

See the below as an example with what I thought the syntax would look like where userid belongs to df1 and username belongs to df2. Anyone help me out?

请参阅下面的示例,我认为语法看起来像 userid 属于 df1 而 username 属于 df2。有人帮我吗?

 # not working - I assume some syntax issue?
pd.merge(df1, df2, on = [['userid'=='username', 'column1']], how = 'left')

回答by Boud

When the names are different, use the xxx_onparameters instead of on=:

当名称不同时,使用xxx_on参数代替on=

pd.merge(df1, df2, left_on=  ['userid', 'column1'],
                   right_on= ['username', 'column1'], 
                   how = 'left')

回答by aichao

An alternative approach is to use joinsetting the index of the right hand side DataFrameto the columns ['username', 'column1']:

另一种方法是使用join将右侧的索引设置DataFrame为列['username', 'column1']

df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')

The output of this joinmergesthe matched keys from the two differently named key columns, useridand username, into a single column named after the key column of df1, userid; whereas the output of the mergemaintains the two as separate columns. To illustrate, consider the following example:

这样做的输出join合并来自两个不同名称的关键字列匹配键,useridusername,成的键列命名的单个列df1userid; 而 的输出将两者merge保持为单独的列。为了说明,请考虑以下示例:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
##    ID  Values  pID
## 0   1   435.0   21
## 1   2    33.0   22
## 2   3    45.0   23
## 3   4     NaN   24
## 4   5     NaN   25
## 5   6    12.0   26

df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
##    ID  Values  pid
## 0   4     544   24
## 1   4     545   25
## 2   5     676   25

pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
##    ID  Values_x  pID  Values_y   pid
## 0   1     435.0   21       NaN   NaN
## 1   2      33.0   22       NaN   NaN
## 2   3      45.0   23       NaN   NaN
## 3   4       NaN   24     544.0  24.0
## 4   5       NaN   25     676.0  25.0
## 5   6      12.0   26       NaN   NaN

df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
##    ID  Values_x  pID  Values_y
## 0   1     435.0   21       NaN
## 1   2      33.0   22       NaN
## 2   3      45.0   23       NaN
## 3   4       NaN   24     544.0
## 4   5       NaN   25     676.0
## 5   6      12.0   26       NaN

Here, we also need to specify lsuffixand rsuffixin jointo distinguish the overlapping column Valuein the output. As one can see, the output of mergecontains the extra pidcolumn from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtypefor the pidcolumn has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.

在这里,我们还需要指定lsuffixrsuffixinjoin来区分Value输出中的重叠列。如您所见, 的输出merge包含pid来自右侧的额外列DataFrame,恕我直言,考虑到合并的上下文,这是不必要的。还要注意的是,dtype对于pid列已更改为float64,从向上转型由于结果NaN从不匹配的行介绍秒。

This aesthetic output is gained at a cost in performance as the call to set_indexon the right hand side DataFrameincurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:

这种美学输出是以牺牲性能为代价的,因为调用set_index右侧的DataFrame会产生一些开销。然而,一个快速而肮脏的配置文件表明,这并不太可怕,粗略地说30%,这可能是值得的:

sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})

%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)