按 ID 合并两个 Excel 文件并合并具有相同名称的列(python、pandas)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24001360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:06:51  来源:igfitidea点击:

Merging two Excel files by ID and combining columns with same name (python, pandas)

pythonexcelmergepandas

提问by ferrios25

I am new to stackoverflow and pandas for python. I found part of my answer in the post Looking to merge two Excel files by ID into one Excel file using Python 2.7

我是 python 的 stackoverflow 和 pandas 的新手。我在寻找使用 Python 2.7 按 ID 将两个 Excel 文件合并为一个 Excel 文件的帖子中找到了我的部分答案

However, I also want to merge or combine columns from the two excel files with the same name. I thought the following post would have my answer but I guess it's not titled correctly: Merging Pandas DataFrames with the same column name

但是,我还想合并或合并来自两个同名 excel 文件的列。我以为下面的帖子会有我的答案,但我想它的标题不正确:Merging Pandas DataFrames with same column name

Right now I have the code:

现在我有代码:

import pandas as pd

file1 = pd.read_excel("file1.xlsx")
file2 = pd.read_excel("file2.xlsx")

file3 = file1.merge(file2, on="ID", how="outer")

file3.to_excel("merged.xlsx")

file1.xlsx

文件1.xlsx

ID,JanSales,FebSales,test
1,100,200,cars
2,200,500,
3,300,400,boats

ID,JanSales,FebSales,test
1,100,200,cars
2,200,500,
3,300,400,boats

file2.xlsx

文件2.xlsx

ID,CreditScore,EMMAScore,test
2,good,Watson,planes
3,okay,Thompson,
4,not-so-good,NA,

ID,CreditScore,EMMAScore,test
2,good,Watson,planes
3,ok,Thompson,
4,not-so-good,NA,

what I get is merged.xlsx

我得到的是合并的.xlsx

ID,JanSales,FebSales,test_x,CreditScore,EMMAScore,test_y
1,100,200,cars,NaN,NaN,
2,200,500,,good,Watson,planes
3,300,400,boats,okay,Thompson,
4,NaN,NaN,,not-so-good,NaN,

ID,JanSales,FebSales, test_x,CreditScore,EMMAScore, test_y
1,100,200,cars,NaN,NaN,
2,200,500,,good,Watson,planes
3,300,400,boats,ok,Thompson,
4,N-good南,

what I want is merged.xlsx

我想要的是合并.xlsx

ID,JanSales,FebSales,CreditScore,EMMAScore,test
1,100,200,NaN,NaN,cars
2,200,500,good,Watson,planes
3,300,400,okay,Thompson,boats
4,NaN,NaN,not-so-good,NaN,NaA

ID,JanSales,FebSales,CreditScore,EMMAScore, test
1,100,200,NaN,NaN,cars
2,200,500,good,Watson,planes
3,300,400,ok,Thompson,boats
4,NaN,NaN,N,not-so-good,

In my real data, there are 200+ columns that correspond to the "test" column in my example. I want the program to find these columns with the same names in both file1.xlsx and file2.xlsx and combine them in the merged file.

在我的真实数据中,有 200 多列对应于我的示例中的“测试”列。我希望程序在 file1.xlsx 和 file2.xlsx 中找到这些具有相同名称的列,并将它们合并到合并文件中。

回答by EdChum

OK, here is a more dynamic way, after merging we assume that clashes will occur and result in 'column_name_x' or '_y'.

好的,这是一种更动态的方式,合并后我们假设会发生冲突并导致“column_name_x”或“_y”。

So first figure out the common column names and remove 'ID' from this list

所以首先找出常见的列名并从此列表中删除“ID”

In [51]:

common_columns = list(set(list(df1.columns)) & set(list(df2.columns)))
common_columns.remove('ID')
common_columns
Out[51]:
['test']

Now we can iterate over this list to create the new column and use whereto conditionally assign the value dependent on which value is not null.

现在我们可以迭代这个列表来创建新列,并where根据哪个值不为空来有条件地分配值。

In [59]:

for col in common_columns:
    df3[col] = df3[col+'_x'].where(df3[col+'_x'].notnull(), df3[col+'_y'])
df3
Out[59]:
   ID  JanSales  FebSales test_x  CreditScore EMMAScore  test_y    test
0   1       100       200   cars          NaN       NaN     NaN    cars
1   2       200       500    NaN         good    Watson  planes  planes
2   3       300       400  boats         okay  Thompson     NaN   boats
3   4       NaN       NaN    NaN  not-so-good       NaN     NaN     NaN

[4 rows x 8 columns]

Then just to finish off drop all the extra columns:

然后只是为了完成删除所有额外的列:

In [68]:

clash_names = [elt+suffix for elt in common_columns for suffix in ('_x','_y') ]
clash_names
df3.drop(labels=clash_names, axis=1,inplace=True)
df3
Out[68]:
   ID  JanSales  FebSales  CreditScore EMMAScore    test
0   1       100       200          NaN       NaN    cars
1   2       200       500         good    Watson  planes
2   3       300       400         okay  Thompson   boats
3   4       NaN       NaN  not-so-good       NaN     NaN

[4 rows x 6 columns]

The snippet above is from this :Prepend prefix to list elements with list comprehension

上面的代码片段来自:Prepend prefix to list elements with list comprehension