Pandas 合并创建不需要的重复条目
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42444078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas merge creates unwanted duplicate entries
提问by Nic Scozzaro
I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.
我是 Pandas 的新手,我想合并两个具有相似列的数据集。除了许多相同的值之外,与另一列相比,每一列都有一些独特的值。我想保留每列中的一些重复项。我想要的输出如下所示。添加 how='inner' 或 'outer' 不会产生所需的结果。
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
print(pd.merge(df1,df2))
output:
A
0 2
1 2
2 2
3 2
4 3
5 4
6 5
desired/expected output:
A
0 2
1 2
2 3
3 4
4 5
Please let me know how/if I can achieve the desired output using merge, thank you!
请让我知道如何/是否可以使用合并实现所需的输出,谢谢!
EDITTo clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?
编辑为了澄清为什么我对这种行为感到困惑,如果我简单地添加另一列,它不会产生四个 2,而是只有两个 2,所以我希望在我的第一个例子中它也会有两个 2 . 为什么行为似乎发生了变化,Pandas在做什么?
import pandas as pd
dict1 = {'A':[2,2,3,4,5],
'B':['red','orange','yellow','green','blue'],
}
dict2 = {'A':[2,2,3,4,5],
'B':['red','orange','yellow','green','blue'],
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
print(pd.merge(df1,df2))
output:
A B
0 2 red
1 2 orange
2 3 yellow
3 4 green
4 5 blue
However, based on the first example I would expect:
A B
0 2 red
1 2 orange
2 2 red
3 2 orange
4 3 yellow
5 4 green
6 5 blue
采纳答案by Rene
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()
df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
print(df)
Output:
输出:
A
0 2
1 2
2 3
3 4
4 5
回答by AsheKetchum
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2).drop('index', 1, inplace = True)
The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.
这个想法是基于匹配的索引以及匹配的“A”列值进行合并。
以前,由于合并的工作方式取决于匹配,所发生的情况是 df1 中的第一个 2 与 df2 中的第一个和第二个 2 匹配,而 df1 中的第二个 2 与 df2 中的第一个和第二个 2 都匹配为好。
If you try this, you will see what I am talking about.
如果你试试这个,你就会明白我在说什么。
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2, on = 'A')
回答by Qehu
did you try df.drop_duplicates() ?
你试过 df.drop_duplicates() 吗?
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df=pd.merge(df1,df2)
df_new=df.drop_duplicates()
print df
print df_new
Seems that it gives the results that you want
似乎它给出了你想要的结果