Python pandas - 特定的合并/替换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25762902/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:26:54  来源:igfitidea点击:

Python pandas - particular merge/replacement

pythonpandas

提问by Colonel Beauvel

new to pandas operations, I have these two dataframes:

Pandas操作的新手,我有这两个数据框:

import pandas as pd 

df = pd.DataFrame({'name': ['a','a','b','b','c','c'], 'id':[1,2,1,2,1,2], 'val1':[0,0,0,0,0,0],'val2':[0,0,0,0,0,0],'val3':[0,0,0,0,0,0]})

   id name  val1  val2  val3
0   1    a     0     0     0
1   2    a     0     0     0
2   1    b     0     0     0
3   2    b     0     0     0
4   1    c     0     0     0
5   2    c     0     0     0

subdf = pd.DataFrame({'name': ['a','b','c'], 'id':[1,1,2],'val1':[0.3,0.4,0.7], 'val2':[4,5,4]}

   id name  val1  val2
0   1    a   0.3     4
1   1    b   0.4     5
2   2    c   0.7     4   

I would like to obtain as output:

我想获得作为输出:

   id name  val1  val2  val3
0   1    a   0.3     4     0
1   2    a   0.0     0     0
2   1    b   0.4     5     0
3   2    b   0.0     0     0
4   1    c   0.0     0     0
5   2    c   0.7     4     0

But I did not catch example of replacement, just additions of columns/rows from the tutorials I saw !

但是我没有发现替换的例子,只是从我看到的教程中添加了列/行!

回答by EdChum

This takes a couple steps, left mergeon the columns that match, this will create 'x' and 'y' where there are clashes:

这需要几个步骤,留merge在匹配的列上,这将在有冲突的地方创建 'x' 和 'y':

In [25]:

merged = df.merge(subdf, on=['id', 'name'], how='left')
merged
Out[25]:
   id name  val1_x  val2_x  val3  val1_y  val2_y
0   1    a       0       0     0     0.3       4
1   2    a       0       0     0     NaN     NaN
2   1    b       0       0     0     0.4       5
3   2    b       0       0     0     NaN     NaN
4   1    c       0       0     0     NaN     NaN
5   2    c       0       0     0     0.7       4
In [26]:
# take the values that of interest from the clashes
merged['val1'] = np.max(merged[['val1_x', 'val1_y']], axis=1)
merged['val2'] = np.max(merged[['val2_x', 'val2_y']], axis=1)
merged
Out[26]:
   id name  val1_x  val2_x  val3  val1_y  val2_y  val1  val2
0   1    a       0       0     0     0.3       4   0.3     4
1   2    a       0       0     0     NaN     NaN   0.0     0
2   1    b       0       0     0     0.4       5   0.4     5
3   2    b       0       0     0     NaN     NaN   0.0     0
4   1    c       0       0     0     NaN     NaN   0.0     0
5   2    c       0       0     0     0.7       4   0.7     4
In [27]:
# drop the additional columns
merged = merged.drop(labels=['val1_x', 'val1_y','val2_x', 'val2_y'], axis=1)
merged
Out[27]:
   id name  val3  val1  val2
0   1    a     0   0.3     4
1   2    a     0   0.0     0
2   1    b     0   0.4     5
3   2    b     0   0.0     0
4   1    c     0   0.0     0
5   2    c     0   0.7     4

Another method would be to sort both df's on 'id' and 'name' and then call update:

另一种方法是对 'id' 和 'name' 上的 df 进行排序,然后调用update

In [30]:

df = df.sort(columns=['id','name'])
subdf = subdf.sort(columns=['id','name'])
df.update(subdf)
df
Out[30]:
   id name  val1  val2  val3
0   1    a   0.3     4     0
2   2    c   0.7     4     0
4   1    c   0.0     0     0
1   1    b   0.4     5     0
3   2    b   0.0     0     0
5   2    c   0.0     0     0

回答by famaral42

Updatedversion with updatemethod. Inspired by Nic

使用update方法更新版本。灵感来自尼克

I managed to it with concatbut is not as elegant as this one below with updateand DataFrame are copied, which I believe with bigger tables could result in problems with memory and/or speed.

我设法做到了,concat但并不像下面的那样优雅,update并且复制了 DataFrame,我相信更大的表可能会导致内存和/或速度问题。

df = pd.DataFrame({'name': list('aabbcc'), 'id':[1,2]*3, 'val1':[0]*6,'val2':[0]*6,'val3':[0]*6})

subdf = pd.DataFrame({'name': list('abc'), 'id':[1,1,2],'val1':[0.3,0.4,0.7], 'val2':[4,5,4]})

df.set_index(['name','id'], inplace=True)
df.update(subdf.set_index(['name','id']))
df.reset_index(inplace=True)
df

Result:

结果:

    name    id  val1    val2    val3
0   a       1   0.3     4.0     0
1   a       2   0.0     0.0     0
2   b       1   0.4     5.0     0
3   b       2   0.0     0.0     0
4   c       1   0.0     0.0     0
5   c       2   0.7     4.0     0

Minor drawback is that pandas.DataFrame.updatechanges the dtypes pointed out by JAB

小缺点是pandas.DataFrame.update变化dtypes通过指出JAB

回答by KuboAndTwoStrings

The sortfunction in the second part of the above answer has been deprecated. The code for users using Pandas 0.20+ for achieving the same effect is:

sort上述回答的第二部分功能已被弃用。使用 Pandas 0.20+ 的用户实现相同效果的代码是:

df1 = pd.DataFrames(usecols=['A', 'B']) # You want to merge TO this
df2 = pd.DataFrames(usecols=['A', 'B']) # You want to merge FROM this 
df1 = df1.sort_values (by=['A', 'B'])
df2 = df2.sort_values (by=['A', 'B'])
df1.update(df2)

Refer to: Pandas Documentation

参考:Pandas 文档