pandas 合并两个熊猫数据框会导致“重复”列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27313647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merging two pandas dataframes results in "duplicate" columns
提问by larslovlie
I'm trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after merging these columns are "duplicated" with the original headers given a postscript _x, _y, etc.
我正在尝试合并包含相同关键列的两个数据帧。其他一些列也有相同的标题,尽管行数不同,合并后这些列与原始标题“复制”,给出后记 _x、_y 等。
Does anyone know how to get pandas to drop the duplicate columns in the example below?
有谁知道如何让Pandas删除下面示例中的重复列?
This is my python code:
这是我的python代码:
import pandas as pd
holding_df = pd.read_csv('holding.csv')
invest_df = pd.read_csv('invest.csv')
merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0)
merge_df.to_csv('merged.csv', index=False)
And the CSV files contain this:
CSV 文件包含以下内容:
First rows of left-dataframe (holding_df)
左数据框的第一行(holding_df)
key, dept_name, res_name, year, need, holding
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1
...
Right-dataframe (invest_df)
右数据框 (invest_df)
key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000
DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000
DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000
...
Merged result
合并结果
key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0
...
采纳答案by EdChum
The reason you have additional columns with suffixes '_x' and '_y' is because the columns you are merging do not have matching values so this clash produces additional columns. In that case you need to drop the additional '_y' columns and rename the '_x' columns:
您有带有后缀“_x”和“_y”的附加列的原因是因为您要合并的列没有匹配的值,因此这种冲突会产生附加列。在这种情况下,您需要删除额外的 '_y' 列并重命名 '_x' 列:
In [145]:
# define our drop function
def drop_y(df):
# list comprehension of the cols that end with '_y'
to_drop = [x for x in df if x.endswith('_y')]
df.drop(to_drop, axis=1, inplace=True)
drop_y(merged)
merged
Out[145]:
key dept_name_x res_name_x year_x need holding \
0 DeptA_ResA_2015 DeptA ResA 2015 1 1
1 DeptA_ResA_2016 DeptA ResA 2016 1 1
2 DeptA_ResA_2017 DeptA ResA 2017 1 1
no_of_inv inv_cost_wo_ice
0 1 1000000
1 0 0
2 0 0
In [146]:
# func to rename '_x' cols
def rename_x(df):
for col in df:
if col.endswith('_x'):
df.rename(columns={col:col.rstrip('_x')}, inplace=True)
rename_x(merged)
merged
Out[146]:
key dept_name res_name year need holding no_of_inv \
0 DeptA_ResA_2015 DeptA ResA 2015 1 1 1
1 DeptA_ResA_2016 DeptA ResA 2016 1 1 0
2 DeptA_ResA_2017 DeptA ResA 2017 1 1 0
inv_cost_wo_ice
0 1000000
1 0
2 0
EDITIf you added the common columns to your merge then it shouldn't produce the duplicated columns unless the matches on those columns do not match:
编辑如果您将公共列添加到合并中,则除非这些列上的匹配项不匹配,否则不应生成重复的列:
merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)
回答by desmond
I have the same problem with duplicate columns after left joins even when the columns' data is identical. I did a query and found out that NaN values are considered different even if both columns are NaN in pandas 0.14. BUT once you upgrade to 0.15, this problem disappears, which explains why it later works for you, you probably upgraded.
即使列的数据相同,我在左连接后也遇到重复列的相同问题。我做了一个查询,发现即使 Pandas 0.14 中的两列都是 NaN,NaN 值也被认为是不同的。但是一旦你升级到 0.15,这个问题就会消失,这就解释了为什么它后来对你有用,你可能升级了。
回答by fenandosr
Not exactly the answer, but pd.mergeprovides an argument to help you decide which suffixesshould be added to your overlapping columns:
不完全是答案,但pd.merge提供了一个参数来帮助您决定应将哪些后缀添加到重叠列中:
merge_df = pd.merge(holding_df, invest_df, on='key', how='left', suffixes=('_holding', '_invest')).fillna(0)
More meaningful names could be helpful if you decide to keep both (or to check why the columns are kept).
如果您决定保留两者(或检查保留列的原因),更有意义的名称可能会有所帮助。
See documentationfor more reference.
请参阅文档以获取更多参考。

