pandas 熊猫加入具有不同名称的列

Question

提问by runningbirds

I have two different data frames that I want to perform some sql operations on. Unfortunately, as is the case with the data I'm working with, the spelling is often different.

我有两个不同的数据框，我想对其执行一些 sql 操作。不幸的是，就像我正在处理的数据一样，拼写通常不同。

See the below as an example with what I thought the syntax would look like where userid belongs to df1 and username belongs to df2. Anyone help me out?

请参阅下面的示例，我认为语法看起来像 userid 属于 df1 而 username 属于 df2。有人帮我吗？

 # not working - I assume some syntax issue?
pd.merge(df1, df2, on = [['userid'=='username', 'column1']], how = 'left')

Answer 1

回答by Boud

When the names are different, use the xxx_onparameters instead of on=:

当名称不同时，使用xxx_on参数代替on=：

pd.merge(df1, df2, left_on=  ['userid', 'column1'],
                   right_on= ['username', 'column1'], 
                   how = 'left')

Answer 2

回答by aichao

An alternative approach is to use joinsetting the index of the right hand side DataFrameto the columns ['username', 'column1']:

另一种方法是使用join将右侧的索引设置DataFrame为列['username', 'column1']：

df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')

The output of this joinmergesthe matched keys from the two differently named key columns, useridand username, into a single column named after the key column of df1, userid; whereas the output of the mergemaintains the two as separate columns. To illustrate, consider the following example:

这样做的输出join合并来自两个不同名称的关键字列匹配键，userid和username，成的键列命名的单个列df1，userid; 而的输出将两者merge保持为单独的列。为了说明，请考虑以下示例：

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
##    ID  Values  pID
## 0   1   435.0   21
## 1   2    33.0   22
## 2   3    45.0   23
## 3   4     NaN   24
## 4   5     NaN   25
## 5   6    12.0   26

df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
##    ID  Values  pid
## 0   4     544   24
## 1   4     545   25
## 2   5     676   25

pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
##    ID  Values_x  pID  Values_y   pid
## 0   1     435.0   21       NaN   NaN
## 1   2      33.0   22       NaN   NaN
## 2   3      45.0   23       NaN   NaN
## 3   4       NaN   24     544.0  24.0
## 4   5       NaN   25     676.0  25.0
## 5   6      12.0   26       NaN   NaN

df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
##    ID  Values_x  pID  Values_y
## 0   1     435.0   21       NaN
## 1   2      33.0   22       NaN
## 2   3      45.0   23       NaN
## 3   4       NaN   24     544.0
## 4   5       NaN   25     676.0
## 5   6      12.0   26       NaN

Here, we also need to specify lsuffixand rsuffixin jointo distinguish the overlapping column Valuein the output. As one can see, the output of mergecontains the extra pidcolumn from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtypefor the pidcolumn has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.

在这里，我们还需要指定lsuffix和rsuffixinjoin来区分Value输出中的重叠列。如您所见，的输出merge包含pid来自右侧的额外列DataFrame，恕我直言，考虑到合并的上下文，这是不必要的。还要注意的是，dtype对于pid列已更改为float64，从向上转型由于结果NaN从不匹配的行介绍秒。

This aesthetic output is gained at a cost in performance as the call to set_indexon the right hand side DataFrameincurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:

这种美学输出是以牺牲性能为代价的，因为调用set_index右侧的DataFrame会产生一些开销。然而，一个快速而肮脏的配置文件表明，这并不太可怕，粗略地说30%，这可能是值得的：

sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})

%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas 熊猫加入具有不同名称的列

提问by runningbirds

回答by Boud

回答by aichao

相关推荐

最近更新

标签

pandas 熊猫加入具有不同名称的列

提问by runningbirds

回答by Boud

回答by aichao

相关推荐

Pandas read_csv() 1.2GB 文件在具有 140GB RAM 的 VM 上内存不足

pandas 如何根据另一个数据框的条件创建新的数据框

Pandas：更改具有多级列的数据框中的特定列名

按第一列 Pandas 对数据框进行排序

相关推荐

最近更新

标签