基于另一个数据框 python pandas 替换列值 - 更好的方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24768657/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:10:00  来源:igfitidea点击:

Replace column values based on another dataframe python pandas - better way?

pythonpandas

提问by user3314418

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).

注意:为简单起见,我使用了一个玩具示例,因为在堆栈溢出时复制/粘贴数据帧很困难(请告诉我是否有简单的方法可以做到这一点)。

Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.

有没有办法将一个数据帧中的值合并到另一个数据帧而不获取 _X、_Y 列?我希望一列上的值替换另一列的所有零值。

df1: 

Name   Nonprofit    Business    Education

X      1             1           0
Y      0             1           0   <- Y and Z have zero values for Nonprofit and Educ
Z      0             0           0
Y      0             1           0

df2:

Name   Nonprofit    Education
Y       1            1     <- this df has the correct values. 
Z       1            1



pd.merge(df1, df2, on='Name', how='outer')

Name   Nonprofit_X    Business    Education_X     Nonprofit_Y     Education_Y
Y       1                1          1                1               1
Y      1                 1          1                1               1
X      1                 1          0               nan             nan   
Z      1                 1          1                1               1

In a previous post, I tried combine_First and dropna(), but these don't do the job.

在上一篇文章中,我尝试了 combine_First 和 dropna(),但这些都不起作用。

I want to replace zeros in df1 with the values in df2. Furthermore, I want all rows with the same Names to be changed according to df2.

我想用 df2 中的值替换 df1 中的零。此外,我希望根据 df2 更改具有相同名称的所有行。

Name    Nonprofit     Business    Education
Y        1             1           1
Y        1             1           1 
X        1             1           0
Z        1             0           1

(need to clarify: The value in 'Business' column where name = Z should 0.)

(需要澄清:名称 = Z 的“业务”列中的值应为 0。)

My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.

我现有的解决方案执行以下操作:我根据 df2 中存在的名称进行子集化,然后将这些值替换为正确的值。但是,我想要一种不那么笨拙的方法来做到这一点。

pubunis_df = df2
sdf = df1 

regex = str_to_regex(', '.join(pubunis_df.ORGS))

pubunis = searchnamesre(sdf, 'ORGS', regex)

sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

采纳答案by EdChum

Use the boolean mask from isinto filter the df and assign the desired row values from the rhs df:

使用布尔掩码 fromisin过滤 df 并从 rhs df 分配所需的行值:

In [27]:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
  Name  Nonprofit  Business  Education
0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

回答by KSD

In [27]: This is the correct one.

在 [27] 中:这是正确的。

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values

df
Out[27]:

Name  Nonprofit  Business  Education

0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

[4 行 x 4 列]

The above will work only when all rows in df1 exists in df . In other words df should be super set of df1

仅当 df1 中的所有行都存在于 df 中时,上述内容才有效。换句话说 df 应该是 df1 的超集

Incase if you have some non matching rows to df in df1,you should follow below

如果你在 df1 中有一些与 df 不匹配的行,你应该按照下面的操作

In other words df is not superset of df1 :

换句话说 df 不是 df1 的超集:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = 
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

回答by Jeremy Z

Attention: In latest version of pandas, both answers above doesn't work anymore:

注意:在最新版本的熊猫中,以上两个答案都不再适用:

KSD's answer will raise error:

KSD 的回答会引发错误:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1]],columns=["Name","Nonprofit", "Education"])   

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values

Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)

and EdChum's answer will give us the wrong result:

而 EdChum 的回答会给我们错误的结果:

 df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]

df1
Out[852]: 
  Name  Nonprofit  Business  Education
0    X        1.0         1        0.0
1    Y        1.0         1        1.0
2    Z        NaN         0        NaN
3    Y        NaN         1        NaN

Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.

好吧,只有当“名称”列中的值是唯一的并且在两个数据框中都排序时,它才会安全地工作。

Here is my answer:

这是我的回答:

Way 1:

方式一:

df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)

Way 2:

方式二:

df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)

More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.

有关更新的更多指南。. 在“更新”之前,需要设置索引的两个数据框的列名不必相同。您可以尝试“Name1”和“Name2”。此外,即使 df2 中的其他不必要的行也不会更新 df1,它也能工作。换句话说,df2 不需要是 df1 的超集。

Example:

例子:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1],
              ['U',1,3]],columns=["Name2","Nonprofit", "Education"])   

df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')


df1.update(df2)

result:

结果:

      Nonprofit  Business  Education
Name1                                
X           1.0         1        0.0
Y           1.0         1        1.0
Z           1.0         0        1.0
Y           1.0         1        1.0

回答by AVK

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()