Python Pandas 根据另一个数据框中的匹配列填充新的数据框列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39816671/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas populate new dataframe column based on matching columns in another dataframe
提问by user3471881
I have a df
which contains my main data which has one million rows
. My main data also has 30 columns
. Now I want to add another column to my df
called category
. The category
is a column
in df2
which contains around 700 rows
and two other columns
that will match with two columns
in df
.
我有一个df
包含我的主要数据的 100 万rows
. 我的主要数据也有 30 columns
。现在我想在我的df
调用中添加另一列category
。该category
是column
在df2
其中包含约700rows
和两个其他columns
将搭配2columns
中df
。
I begin with setting an index
in df2
and df
that will match between the frames, however some of the index
in df2
doesn't exist in df
.
我首先设置一个index
中df2
和df
,将帧之间的匹配,但是部分的index
中df2
并不存在df
。
The remaining columns in df2
are called AUTHOR_NAME
and CATEGORY
.
中的其余列df2
称为AUTHOR_NAME
and CATEGORY
。
The relevant column in df
is called AUTHOR_NAME
.
中的相关列df
称为AUTHOR_NAME
。
Some of the AUTHOR_NAME
in df
doesn't exist in df2
and vice versa.
一些AUTHOR_NAME
indf
不存在,df2
反之亦然。
The instruction I want is: when index
in df
matches with index
in df2
and title
in df
matches with title
in df2
, add category
to df
, else add NaN in category
.
我要的指令是:当index
在df
比赛用index
的df2
和title
在df
比赛用title
的df2
,添加category
到df
,否则在加的NaN category
。
Example data:
示例数据:
df2
AUTHOR_NAME CATEGORY
Index
Pub1 author1 main
Pub2 author1 main
Pub3 author1 main
Pub1 author2 sub
Pub3 author2 sub
Pub2 author4 sub
df
AUTHOR_NAME ...n amount of other columns
Index
Pub1 author1
Pub2 author1
Pub1 author2
Pub1 author3
Pub2 author4
expected_result
AUTHOR_NAME CATEGORY ...n amount of other columns
Index
Pub1 author1 main
Pub2 author1 main
Pub1 author2 sub
Pub1 author3 NaN
Pub2 author4 sub
If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])
my df
becomes three times bigger than it is supposed to be.
如果我使用df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])
我的df
会变成比预期大三倍。
So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2
as a lookup table and then return type
values to df
depending on if certain conditions are met.
所以我想也许合并是解决这个问题的错误方法。我真正想做的是df2
用作查找表,然后根据是否满足某些条件返回type
值df
。
def calculate_category(df2, d):
category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
return str(category_row['CATEGORY'].iat[0])
df.apply(lambda d: calculate_category(df2, d), axis=1)
However, this throws me an error:
但是,这给我带来了一个错误:
IndexError: ('index out of bounds', u'occurred at index 7614')
回答by piRSquared
Consider the following dataframes df
and df2
考虑以下数据帧df
和df2
df = pd.DataFrame(dict(
AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
title= list('zyxwvutsrqponml')
))
df2 = pd.DataFrame(dict(
AUTHOR_NAME=list('AABCCEGG'),
title =list('zwvtrpml'),
CATEGORY =list('11223344')
))
option 1merge
选项1merge
df.merge(df2, how='left')
option 2join
选项 2join
cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)
both options yield
两种选择都产生
回答by Nickil Maveli
APPROACH 1:
方法 1:
You could use concat
instead and drop the duplicated values present in both Index
and AUTHOR_NAME
columns combined. After that, use isin
for checking membership:
您可以concat
改为使用并删除Index
和AUTHOR_NAME
列组合中存在的重复值。之后,isin
用于检查成员资格:
df_concat = pd.concat([df2, df]).reset_index().drop_duplicates(['Index', 'AUTHOR_NAME'])
df_concat.set_index('Index', inplace=True)
df_concat[df_concat.index.isin(df.index)]
Note: The column Index
is assumed to be set as the index column for both the DF's
.
注意:Index
假定该列被设置为DF's
.
APPROACH 2:
方法 2:
Use join
after setting the index column correctly as shown:
join
正确设置索引列后使用,如图:
df2.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
df.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
df.join(df2).reset_index()
回答by kiltannen
While the other answers here give very good and elegant solutions to the asked question, I have found a resource that both answers this question in an extremely elegant fashion, as well as giving a beautifully clear and straightforward set of examples on how to accomplish join/ merge of dataframes, effectively teaching LEFT, RIGHT, INNER and OUTER joins.
虽然这里的其他答案为提出的问题提供了非常好的和优雅的解决方案,但我找到了一个资源,它既能以极其优雅的方式回答这个问题,也能提供一组关于如何完成 join/ 的漂亮清晰和直接的示例合并数据框,有效地教授 LEFT、RIGHT、INNER 和 OUTER 连接。
Join And Merge Pandas Dataframe
I honestly feel any further seekers after this topic will want to also examine his examples...
老实说,我觉得在这个主题之后任何进一步的寻求者也想检查他的例子......
回答by Bhagabat Behera
You may try the following. It will merge both the datasets on specified column as key.
您可以尝试以下方法。它将合并指定列上的两个数据集作为键。
expected_result = pd.merge(df, df2, on = 'CATEGORY', how = 'left')
回答by NikoTumi
Try
尝试
df = df.combine_first(df2)