pandas 合并两个熊猫数据框会导致“重复”列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27313647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:44:38  来源:igfitidea点击:

Merging two pandas dataframes results in "duplicate" columns

pythonpandas

提问by larslovlie

I'm trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after merging these columns are "duplicated" with the original headers given a postscript _x, _y, etc.

我正在尝试合并包含相同关键列的两个数据帧。其他一些列也有相同的标题,尽管行数不同,合并后这些列与原始标题“复制”,给出后记 _x、_y 等。

Does anyone know how to get pandas to drop the duplicate columns in the example below?

有谁知道如何让Pandas删除下面示例中的重复列?

This is my python code:

这是我的python代码:

import pandas as pd

holding_df = pd.read_csv('holding.csv')
invest_df = pd.read_csv('invest.csv')

merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0)
merge_df.to_csv('merged.csv', index=False)

And the CSV files contain this:

CSV 文件包含以下内容:

First rows of left-dataframe (holding_df)

左数据框的第一行(holding_df)

key, dept_name, res_name, year, need, holding
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1
...

Right-dataframe (invest_df)

右数据框 (invest_df)

key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000
DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000
DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000
...

Merged result

合并结果

key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0
...

采纳答案by EdChum

The reason you have additional columns with suffixes '_x' and '_y' is because the columns you are merging do not have matching values so this clash produces additional columns. In that case you need to drop the additional '_y' columns and rename the '_x' columns:

您有带有后缀“_x”和“_y”的附加列的原因是因为您要合并的列没有匹配的值,因此这种冲突会产生附加列。在这种情况下,您需要删除额外的 '_y' 列并重命名 '_x' 列:

In [145]:
# define our drop function
def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)

drop_y(merged)
merged
Out[145]:
               key  dept_name_x  res_name_x   year_x   need   holding  \
0  DeptA_ResA_2015        DeptA        ResA     2015      1         1   
1  DeptA_ResA_2016        DeptA        ResA     2016      1         1   
2  DeptA_ResA_2017        DeptA        ResA     2017      1         1   

    no_of_inv   inv_cost_wo_ice  
0           1           1000000  
1           0                 0  
2           0                 0  
In [146]:
# func to rename '_x' cols
def rename_x(df):
    for col in df:
        if col.endswith('_x'):
            df.rename(columns={col:col.rstrip('_x')}, inplace=True)
rename_x(merged)
merged
Out[146]:
               key  dept_name  res_name   year   need   holding   no_of_inv  \
0  DeptA_ResA_2015      DeptA      ResA   2015      1         1           1   
1  DeptA_ResA_2016      DeptA      ResA   2016      1         1           0   
2  DeptA_ResA_2017      DeptA      ResA   2017      1         1           0   

    inv_cost_wo_ice  
0           1000000  
1                 0  
2                 0 

EDITIf you added the common columns to your merge then it shouldn't produce the duplicated columns unless the matches on those columns do not match:

编辑如果您将公共列添加到合并中,则除非这些列上的匹配项不匹配,否则不应生成重复的列:

merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)

回答by desmond

I have the same problem with duplicate columns after left joins even when the columns' data is identical. I did a query and found out that NaN values are considered different even if both columns are NaN in pandas 0.14. BUT once you upgrade to 0.15, this problem disappears, which explains why it later works for you, you probably upgraded.

即使列的数据相同,我在左连接后也遇到重复列的相同问题。我做了一个查询,发现即使 Pandas 0.14 中的两列都是 NaN,NaN 值也被认为是不同的。但是一旦你升级到 0.15,这个问题就会消失,这就解释了为什么它后来对你有用,你可能升级了。

回答by fenandosr

Not exactly the answer, but pd.mergeprovides an argument to help you decide which suffixesshould be added to your overlapping columns:

不完全是答案,但pd.merge提供了一个参数来帮助您决定应将哪些后缀添加到重叠列中:

merge_df = pd.merge(holding_df, invest_df, on='key', how='left', suffixes=('_holding', '_invest')).fillna(0)

More meaningful names could be helpful if you decide to keep both (or to check why the columns are kept).

如果您决定保留两者(或检查保留列的原因),更有意义的名称可能会有所帮助。

See documentationfor more reference.

请参阅文档以获取更多参考。