pandas 熊猫加入具有不同名称的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40570143/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas join on columns with different names
提问by runningbirds
I have two different data frames that I want to perform some sql operations on. Unfortunately, as is the case with the data I'm working with, the spelling is often different.
我有两个不同的数据框,我想对其执行一些 sql 操作。不幸的是,就像我正在处理的数据一样,拼写通常不同。
See the below as an example with what I thought the syntax would look like where userid belongs to df1 and username belongs to df2. Anyone help me out?
请参阅下面的示例,我认为语法看起来像 userid 属于 df1 而 username 属于 df2。有人帮我吗?
# not working - I assume some syntax issue?
pd.merge(df1, df2, on = [['userid'=='username', 'column1']], how = 'left')
回答by Boud
When the names are different, use the xxx_on
parameters instead of on=
:
当名称不同时,使用xxx_on
参数代替on=
:
pd.merge(df1, df2, left_on= ['userid', 'column1'],
right_on= ['username', 'column1'],
how = 'left')
回答by aichao
An alternative approach is to use join
setting the index of the right hand side DataFrame
to the columns ['username', 'column1']
:
另一种方法是使用join
将右侧的索引设置DataFrame
为列['username', 'column1']
:
df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')
The output of this join
mergesthe matched keys from the two differently named key columns, userid
and username
, into a single column named after the key column of df1
, userid
; whereas the output of the merge
maintains the two as separate columns. To illustrate, consider the following example:
这样做的输出join
合并来自两个不同名称的关键字列匹配键,userid
和username
,成的键列命名的单个列df1
,userid
; 而 的输出将两者merge
保持为单独的列。为了说明,请考虑以下示例:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
## ID Values pID
## 0 1 435.0 21
## 1 2 33.0 22
## 2 3 45.0 23
## 3 4 NaN 24
## 4 5 NaN 25
## 5 6 12.0 26
df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
## ID Values pid
## 0 4 544 24
## 1 4 545 25
## 2 5 676 25
pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
## ID Values_x pID Values_y pid
## 0 1 435.0 21 NaN NaN
## 1 2 33.0 22 NaN NaN
## 2 3 45.0 23 NaN NaN
## 3 4 NaN 24 544.0 24.0
## 4 5 NaN 25 676.0 25.0
## 5 6 12.0 26 NaN NaN
df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
## ID Values_x pID Values_y
## 0 1 435.0 21 NaN
## 1 2 33.0 22 NaN
## 2 3 45.0 23 NaN
## 3 4 NaN 24 544.0
## 4 5 NaN 25 676.0
## 5 6 12.0 26 NaN
Here, we also need to specify lsuffix
and rsuffix
in join
to distinguish the overlapping column Value
in the output. As one can see, the output of merge
contains the extra pid
column from the right hand side DataFrame
, which IMHO is unnecessary given the context of the merge. Note also that the dtype
for the pid
column has changed to float64
, which results from upcasting due to the NaN
s introduced from the unmatched rows.
在这里,我们还需要指定lsuffix
和rsuffix
injoin
来区分Value
输出中的重叠列。如您所见, 的输出merge
包含pid
来自右侧的额外列DataFrame
,恕我直言,考虑到合并的上下文,这是不必要的。还要注意的是,dtype
对于pid
列已更改为float64
,从向上转型由于结果NaN
从不匹配的行介绍秒。
This aesthetic output is gained at a cost in performance as the call to set_index
on the right hand side DataFrame
incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%
, which may be worth it:
这种美学输出是以牺牲性能为代价的,因为调用set_index
右侧的DataFrame
会产生一些开销。然而,一个快速而肮脏的配置文件表明,这并不太可怕,粗略地说30%
,这可能是值得的:
sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)