在python中合并数据帧时出现重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39019591/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Duplicated rows when merging dataframes in python
提问by Roberto Bertinetti
I am currently merging 2 dataframes with an outer join, but after merging, I see all the rows are duplicated even when the columns I did the merge upon contain the same values. In detail:
我目前正在将 2 个数据帧与外部连接合并,但合并后,即使我进行合并的列包含相同的值,我也看到所有行都重复。详细:
list_1 = pd.read_csv('list_1.csv')
list_2 = pd.read_csv('list_2.csv')
merged_list = pd.merge(list_1 , list_2 , on=['email_address'], how='inner')
with the following input and results:
具有以下输入和结果:
list_1:
列表_1:
email_address, name, surname
[email protected], john, smith
[email protected], john, smith
[email protected], elvis, presley
list_2:
列表_2:
email_address, street, city
[email protected], street1, NY
[email protected], street1, NY
[email protected], street2, LA
merged_list:
合并列表:
email_address, name, surname, street, city
[email protected], john, smith, street1, NY
[email protected], john, smith, street1, NY
[email protected], john, smith, street1, NY
[email protected], john, smith, street1, NY
[email protected], elvis, presley, street2, LA
[email protected], elvis, presley, street2, LA
My question is, shouldn't it be like this?
我的问题是,不应该是这样吗?
merged_list (how I would like it to be :D):
合并列表(我希望它如何:D):
email_address, name, surname, street, city
[email protected], john, smith, street1, NY
[email protected], john, smith, street1, NY
[email protected], elvis, presley, street2, LA
How can I make it so that it becomes like this? Thanks a lot for your help!
我怎样才能让它变成这样?非常感谢你的帮助!
回答by piRSquared
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1
matches with each john smith in list_2
. I had to drop the duplicates in one of the lists. I chose list_2
.
预计会出现重复的行。每个 john smithlist_1
与 . 的每个 john smith 匹配list_2
。我不得不在其中一个列表中删除重复项。我选择了list_2
。