Python Pandas Merge - 如何避免重复列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19125091/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Merge - How to avoid duplicating columns
提问by user1911092
I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.
我正在尝试在两个数据框之间进行合并。每个数据框都有两个索引级别(日期,cusip)。例如,在列中,某些列在两者之间匹配(货币、调整日期)。
What is the best way to merge these by index, but to not take two copies of currency and adj date.
按索引合并这些的最佳方法是什么,但不要使用货币和调整日期的两个副本。
Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.
每个数据框有 90 列,所以我尽量避免手工写出所有内容。
df: currency adj_date data_col1 ...
date cusip
2012-01-01 XSDP USD 2012-01-03 0.45
...
df2: currency adj_date data_col2 ...
date cusip
2012-01-01 XSDP USD 2012-01-03 0.45
...
If I do:
如果我做:
dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')
I get
我得到
dfNew: currency_x adj_date_x data_col2 ... currency_y adj_date_y
date cusip
2012-01-01 XSDP USD 2012-01-03 0.45 USD 2012-01-03
Thank you! ...
谢谢!...
采纳答案by EdChum
You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.
您可以计算出仅在一个 DataFrame 中的列,并使用它来选择合并中的列子集。
cols_to_use = df2.columns.difference(df.columns)
Then perform the merge (note this is an index object but it has a handy tolist()
method).
然后执行合并(注意这是一个索引对象,但它有一个方便的tolist()
方法)。
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
This will avoid any columns clashing in the merge.
这将避免任何列在合并中发生冲突。
回答by rprog
回答by JulienD
I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answerand this onefrom Stackoverflow
我是 Pandas 的新手,但我想实现同样的目标,自动避免使用 _x 或 _y 的列名并删除重复数据。我终于用这个做了回答,这一个从#1
sales.csv
销售.csv
city;state;units Mendocino;CA;1 Denver;CO;4 Austin;TX;2
revenue.csv
收入.csv
branch_id;city;revenue;state_id 10;Austin;100;TX 20;Austin;83;TX 30;Austin;4;TX 47;Austin;200;TX 20;Denver;83;CO 30;Springfield;4;I
merge.py import pandas
merge.py 导入熊猫
def drop_y(df):
# list comprehension of the cols that end with '_y'
to_drop = [x for x in df if x.endswith('_y')]
df.drop(to_drop, axis=1, inplace=True)
sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')
result = pandas.merge(sales, revenue, how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')
When executing the merge command I replace the _x
suffix with an empty string and them I can remove columns ending with _y
执行合并命令时,我用_x
空字符串替换后缀,然后我可以删除以_y
output.csv
输出.csv
id;city;state;units;branch_id;revenue;state_id 0;Denver;CO;4;20;83;CO 1;Austin;TX;2;10;100;TX 2;Austin;TX;2;20;83;TX 3;Austin;TX;2;30;4;TX 4;Austin;TX;2;47;200;TX