pandas：合并（内连接）数据框的行数比原始数据框多

Question

提问by Edamame

I am using python 3.4 on Jupyter Notebook, trying to merge two data frame like below:

我在 Jupyter Notebook 上使用 python 3.4，尝试合并两个数据框，如下所示：

df_A.shape
(204479, 2)

df_B.shape
(178, 3)

new_df = pd.merge(df_A, df_B,  how='inner', on='my_icon_number')
new_df.shape
(266788, 4)

I thought the new_dfmerged above should have few rows than df_Asince merge is like an inner join. But why new_dfhere actually has more rows than df_A?

我认为new_df上面的合并应该有几行，df_A因为合并就像一个内部连接。但是为什么new_df这里的行实际上比df_A?

Here is what I actually want:

这是我真正想要的：

my df_Ais like:

我df_A的就像：

 id           my_icon_number
-----------------------------
 A1             123             
 B1             234
 C1             123
 D1             235
 E1             235
 F1             400

and my df_Bis like:

我df_B的就像：

my_icon_number    color      size
-------------------------------------
  123              blue      small
  234              red       large 
  235              yellow    medium

Then I want new_dfto be:

那么我想new_df成为：

 id           my_icon_number     color       size
--------------------------------------------------
 A1             123              blue        small
 B1             234              red         large
 C1             123              blue        small
 D1             235              yellow      medium
 E1             235              yellow      medium

I don't really want to remove duplicates of my_icon_number in df_A. Any idea what I missed here?

我真的不想删除 df_A 中 my_icon_number 的重复项。知道我在这里错过了什么吗？

Answer 1

采纳答案by piRSquared

Because you have duplicates of the merge column in both data sets, you'll get k * mrows with that merge column value, where kis the number of rows with that value in data set 1 and mis the number of rows with that value in data set 2.

因为您在两个数据集中都有合并列的重复项，所以您将获得k * m具有该合并列值k的行，其中是数据集 1m中具有该值的行数，是数据集 2 中具有该值的行数.

try drop_duplicates

尝试 drop_duplicates

dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])

new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')

Example

例子

In this example, the only value in common is 4but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.

在这个例子中，唯一的共同值是4但我在每个数据集中有 3 次。这意味着我应该在结果合并中总共得到 9 行，每个组合一个。

df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))

pd.merge(df_A, df_B,  how='inner', on='my_icon_number')

   my_icon_number  other_column1  other_column2
0               4              3              0
1               4              3              1
2               4              3              2
3               4              4              0
4               4              4              1
5               4              4              2
6               4              5              0
7               4              5              1
8               4              5              2

pandas：合并（内连接）数据框的行数比原始数据框多

提问by Edamame

采纳答案by piRSquared

Example

例子

相关推荐

最近更新

标签

pandas：合并（内连接）数据框的行数比原始数据框多

提问by Edamame

采纳答案by piRSquared

Example

例子

相关推荐

在没有 elasticsearch-py 的情况下将 Pandas 数据帧索引到 Elasticsearch

Pandas dataframe.to_html() - 为标题添加背景颜色

在 Python pandas DataFrame 中为数字添加千位分隔符的简单方法

使用 Pandas read_html 的问题

相关推荐

最近更新

标签