Python Pandas Merge - 如何避免重复列

Question

提问by user1911092

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.

我正在尝试在两个数据框之间进行合并。每个数据框都有两个索引级别（日期，cusip）。例如，在列中，某些列在两者之间匹配（货币、调整日期）。

What is the best way to merge these by index, but to not take two copies of currency and adj date.

按索引合并这些的最佳方法是什么，但不要使用货币和调整日期的两个副本。

Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.

每个数据框有 90 列，所以我尽量避免手工写出所有内容。

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

If I do:

如果我做：

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

I get

我得到

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

Thank you! ...

谢谢！...

Answer 1

采纳答案by EdChum

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

您可以计算出仅在一个 DataFrame 中的列，并使用它来选择合并中的列子集。

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist()method).

然后执行合并（注意这是一个索引对象，但它有一个方便的tolist()方法）。

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

这将避免任何列在合并中发生冲突。

Answer 2

回答by rprog

I use the suffixesoption in .merge():

我使用以下suffixes选项.merge()：

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

Thanks @ijoseph

谢谢@约瑟夫

Answer 3

回答by JulienD

I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answerand this onefrom Stackoverflow

我是 Pandas 的新手，但我想实现同样的目标，自动避免使用 _x 或 _y 的列名并删除重复数据。我终于用这个做了回答，这一个从＃1

sales.csv

销售.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

收入.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

merge.py 导入熊猫

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _xsuffix with an empty string and them I can remove columns ending with _y

执行合并命令时，我用_x空字符串替换后缀，然后我可以删除以_y

output.csv

输出.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

Python Pandas Merge - 如何避免重复列

提问by user1911092

采纳答案by EdChum

回答by rprog

回答by JulienD

相关推荐

最近更新

标签

Python Pandas Merge - 如何避免重复列

提问by user1911092

采纳答案by EdChum

回答by rprog

回答by JulienD

相关推荐

Python 高效的字典搜索？

Python 如何在 PyCharm 中禁用 PEP 8 的特殊命名约定检查

使用 OpenCV Python 的 2D 图像中的深度错误

如何使用selenium python通过href值查找元素？

相关推荐

最近更新

标签